Software Monitoring: The Things that Might Interest You
Editorial note: I originally wrote this post for the Monitis blog. You can check out the original here, at their site. While you’re there, have a look at the different sorts of production concerns that you can keep an eye on with their offering, some of which I address in this post.
If you have responsibility for software in production, I bet you’d like to know more about it. I don’t mean that you’d like an extra peek into the bowels of the source code or to understand its philosophical place in the universe. Rather, I bet you’d like to know more about how it behaves in the wild.
After all, from this opaque vantage point comes the overwhelming majority of maddening defects. “But it doesn’t do that in our environment,” you cry. “How can we even begin to track down a user report of, ‘sometimes that button doesn’t work right?'”
To combat this situation we have, since programmer time immemorial, turned to the log file. In that file, we find answers. Except, we find them the way an archaeologist finds answers about ancient civilizations. We assemble cryptic, incomplete fragments and try to use them to deduce what happened long after the fact. Better than nothing, but not great.
Because of the incompleteness and the lag, we seek other solutions. With the rise in sophistication of tooling and the growth of the DevOps movement, we close the timing gap via monitoring. Rather than wait for a user to report an error and asking for a log file, we get out in front of the matter. When something flies off the rails, our monitoring tools quickly alert us, and we begin triage immediately.
Common Monitoring Use Cases
Later in this post, I will get imaginative. In writing this, I intend to expose you to some less common monitoring ideas that you might at least contemplate, if not outright implement. But for now, let’s consider some relative blue chip monitoring scenarios. These will transcend even the basic nature of the application and apply equally well to web, mobile, or desktop apps.
Monitis offers a huge variety of monitoring services, as the name implies. You can get your bearings about the full offering here. This means that if you want to do it, you can probably find an offering of theirs to do it, unless you’re really out there. Then you might want to supplement their offering with some customized functionality for your own situation.
But let’s say you’d just signed up for the service and wanted to test drive it. I can think of nothing simpler than “is this thing on?” Wherever it runs, you’d love some information about whether it runs when it should. On top of that, you’d probably also like to know whether it dies unexpectedly and ignobly. When your app crashes embarrassingly, you want to know about it.
Once you’ve buttoned up the real basics, you might start to monitor for somewhat more nuanced situations. Does your code gobble up too many hardware resources, causing poor experience or added expense? Does it interact with services or databases that fail or go offline? In short, does your application wobble into sub-optimal states?
But what if we look beyond those basics? Let’s explore some things you may never have contemplated monitoring about your software.
Facebook has developed some reputation around having deployment nirvana. They constantly roll to production and use a sophisticated series of checks, balances, tests, and monitoring to alert them to problems needing correction. If the number of baby pictures in my feed are any indication, I’d say they’re doing pretty well.
But what happens if Facebook pushes something to production with a mistake not easily caught by automated unit tests? For instance, what if they accidentally deployed some CSS that turned the “post” button and its text the same color as the background. The flow of baby pictures would cease, even as all tests passed with flying colors.
Monitis offers something called “real user monitoring,” which generalizes a specific case can address this situation. You may want to monitor user behavior in terms of how they engage with the site. If Facebook monitors how many times per second its users click “post,” and they see that drop to 0 after a production roll, they’ll know they have an issue almost immediately. Even if they don’t know what causes it, they can triage and mitigate almost immediately.
If you have responsibility for any sort of e-commerce operation, I strongly encourage you to monitor your revenue. In a sense, you might consider this a specific instance of user engagement. You’ll have some sort of normal drip of people making purchases. Anything affecting that presents you with an obvious red flag.
You might be tempted to think of this as an accounting problem more than a technical one. Let techies monitor the nuts and bolts and accounting can worry about P&L? I don’t advise it. Purchases count as arguably the most important metric. They form the lifeblood of your business.
You mainly think of a “bounce” when you think of web applications. Google defines bounce as “a single-page session on your site.” I believe this plays on the opposite of “sticking.” People land, and “bounce off” of your site.
I’m going to re-appropriate the term a bit for our purposes here and generalize it to all application platforms. You might want to monitor the rate at which users exit your application from a particular page/screen.
When they leave from, say, an “exit” screen, then fine. You’d want a high percentage of departures from expected places. But if people begin to leave from a place you’d expect them to remain engaged, that might give you insight into a problem of some kind. This holds doubly true if it suddenly spikes in one particular place.
User Experience Concerns
This particular concern would require some fairly sophisticated monitoring capabilities, most likely instrumented from within. If you do implement such a thing, take care not to impact performance. But, if you’re up for it, you might learn some interesting things.
Consider monitoring user behavior for user experience concerns. For instance, do users consistently dismiss a dialog far too quickly to have read it? Or perhaps do they all tend to execute the same key sequences to navigate through several screens? If so, you might have located opportunities to improve your user experience. Get rid of superfluous dialog messages and see about adding shortcuts for things they do frequently.
And you certainly aren’t limited by my suggestions here. If you have the capability to monitor interactions like this, study your own users with their particular happens and look to improve their experience.
Time to Load Visual Elements
This is another item that you hear about most frequently in websites. But, as with my looser interpretation of the “bounce” concept, you could really measure this anywhere. After all, sluggishness is sluggishness.
If you find yourself in a position to monitor the visual performance of your software, you stand to benefit from doing so. Few things torpedo the user experience as quickly as maddeningly slow loads. If this is happening, you want to know about it.
This holds doubly true for visual elements superfluous or non-essential to the experience itself. In the world of websites, think of ads or random widgets. And, while you can test a lot of this for yourself, concerns may arise in the wild that you can’t mimic in your own shop.
Think of Your Own in the Spirit of Innovation
I’ve enjoyed the exercise in exploring what you might want to monitor. As both an entrepreneur and software developer, I like thinking about possible implementations, offerings, and features.
In fact, that captures what I find so appealing about the DevOps movement. As we marry software creation and software delivery, we open up an entire new category of innovation, that requires new and powerful tools. We can then combine those tools with the inventive spirit to deliver ever-higher quality software.