Defining Done for Your Deployment Process
A Tale of Debugging
The other day, someone came to me and told me that a component of one of the web applications that my team maintains seemed to have a problem. I sat with her, and she showed me what was going on. Sure enough, I saw the issue too. It was the kind of integration thing for which we were able to muster up some historical record and see approximately when the problem had started. Apparently, the problem first occurred around the last time we pushed a version of the website into production. Ruh roh.
Given that this was a pretty critical issue, I got right down to debugging. Pretty quickly, I found the place in the code where the integration callout was happening, and I stepped through it in the debugger. (As an aside, I realize I’ve often made the case against much debugger use. But when legacy code has no unit, component or integration tests, you really don’t have a lot of options.) No exceptions were thrown, and no obvious problems were occurring. It just hit a third party API, went through it uneventfully, but quietly failed.
At this point, it was time for some tinkering and reverse engineering. When I looked at the method call that seemed to be the meat of the issue, I noticed that it returned an integer that the code I was debugging just ignored. Hovering over it, XML doc comment engine told me that it was returning an error code. I would have preferred an exception, but whatever — this was progress. Unfortunately, that was all the help I got, and there was no indication what any returned error code meant. I ran the code and saw that I was getting a “2,” so presumably there was an error occurring.
Maddeningly, there was no online documentation of this API. The only way I was able to proceed was to install a trial version of this utility locally, on my desktop, and read the CHM file. They offered one through their website, but it was broken. After digging, I found that this error code meant “call failure or license file missing.” Hmm… license file, eh? I started to get that tiny adrenaline rush that you get when a solution seems like it might be just around the corner. I had just replaced the previous deployment process of “copy over the files that are different to the server” with a slightly less icky “wipe the server’s directory and put our stuff there.” It had taken some time to iron out failures and bring all of the dependencies under source control, but I viewed this as antiseptic on a festering sore. And, apparently, I had missed one. Upon diving further into the documentation, I saw that it required some weirdly-named license file with some kind of key in it to be in the web application root’s “bin” folder on the server, or it would just quietly fail. Awesome.
This was confirmed by going back to a historical archive of the site, finding that weird file, putting it into production and observing that the problem was resolved. So time to call it a day, right?
Fixing the Deeper Issue
Well, if you call it a day now, there’s a good chance this will happen again later. After all, the only thing that will prevent this after the next deployment is someone remembering, “oh, yeah, we have to copy over that weird license file thing into that directory from the previous deploy.” I don’t know about you, but I don’t really want important system functionality hinging on “oh, yeah, that thing!”
What about a big, fat comment in the code? Something like “this method call will fail if license file xyz isn’t in abc directory?” Well, in a year when everyone has forgotten this and there’s a new architect in town, that’ll at least save a headache next time this issue occurs. But this is reactionary. It has the advantage of not being purely tribal knowledge, but it doesn’t preemptively solve the problem. Another idea might be to trap error codes and throw an exception with a descriptive message, but this is just another step in making the gap between failure and resolution a shorter one. I think we should try avoid failing at all, though having comments and better error trapping is certainly a good idea in addition to whatever better solution comes next.
What about checking the license file into source control and designing the build to copy it to that directory? Win, right? Well, right — it solves the problem. With the next deploy, the license file will be on the server, and that means this particular issue won’t occur in the future. So now it must be time to call it day, right?
Still no, I’d argue. There’s work to be done, and it’s not easy or quick work. Because what needs to happen now is a move from a “delete the contents of the server directory and unzip the new deliverable” deployment to an automated build and deployment. What also needs to happen is a series of automated acceptance tests in a staging environment and possibly regression tests in a production environment. In this situation, not only are developers reading the code aware of the dependency on that license file, not only do failures happen quickly and obviously, and not only is the license file deployed to where it needs to go, but if anything ever goes wrong, automated notifications will occur immediately and allow the situation to be corrected.
It may seem pretty intense to set all that up, but it’s the responsible thing to do. Along the spectrum of some “deployment maturity model,” tweaking things on the server and forgetting about it is whatever score is “least mature.” What I’m talking about is “most mature.” Will it take some doing to get there and probably some time? Absolutely. Does that mean that anything less can be good enough? Nope. Not in my opinion, anyway.