Defining Done for Your Deployment Process

A Tale of Debugging

The other day, someone came to me and told me that a component of one of the web applications that my team maintains seemed to have a problem. I sat with her, and she showed me what was going on. Sure enough, I saw the issue too. It was the kind of integration thing for which we were able to muster up some historical record and see approximately when the problem had started. Apparently, the problem first occurred around the last time we pushed a version of the website into production. Ruh roh.

Given that this was a pretty critical issue, I got right down to debugging. Pretty quickly, I found the place in the code where the integration callout was happening, and I stepped through it in the debugger. (As an aside, I realize I’ve often made the case against much debugger use. But when legacy code has no unit, component or integration tests, you really don’t have a lot of options.) No exceptions were thrown, and no obvious problems were occurring. It just hit a third party API, went through it uneventfully, but quietly failed.

At this point, it was time for some tinkering and reverse engineering. When I looked at the method call that seemed to be the meat of the issue, I noticed that it returned an integer that the code I was debugging just ignored. Hovering over it, XML doc comment engine told me that it was returning an error code. I would have preferred an exception, but whatever — this was progress. Unfortunately, that was all the help I got, and there was no indication what any returned error code meant. I ran the code and saw that I was getting a “2,” so presumably there was an error occurring.

Maddeningly, there was no online documentation of this API. The only way I was able to proceed was to install a trial version of this utility locally, on my desktop, and read the CHM file. They offered one through their website, but it was broken. After digging, I found that this error code meant “call failure or license file missing.” Hmm… license file, eh? I started to get that tiny adrenaline rush that you get when a solution seems like it might be just around the corner. I had just replaced the previous deployment process of “copy over the files that are different to the server” with a slightly less icky “wipe the server’s directory and put our stuff there.” It had taken some time to iron out failures and bring all of the dependencies under source control, but I viewed this as antiseptic on a festering sore. And, apparently, I had missed one. Upon diving further into the documentation, I saw that it required some weirdly-named license file with some kind of key in it to be in the web application root’s “bin” folder on the server, or it would just quietly fail. Awesome.

This was confirmed by going back to a historical archive of the site, finding that weird file, putting it into production and observing that the problem was resolved. So time to call it a day, right?

Fixing the Deeper Issue

Well, if you call it a day now, there’s a good chance this will happen again later. After all, the only thing that will prevent this after the next deployment is someone remembering, “oh, yeah, we have to copy over that weird license file thing into that directory from the previous deploy.” I don’t know about you, but I don’t really want important system functionality hinging on “oh, yeah, that thing!”

What about a big, fat comment in the code? Something like “this method call will fail if license file xyz isn’t in abc directory?” Well, in a year when everyone has forgotten this and there’s a new architect in town, that’ll at least save a headache next time this issue occurs. But this is reactionary. It has the advantage of not being purely tribal knowledge, but it doesn’t preemptively solve the problem. Another idea might be to trap error codes and throw an exception with a descriptive message, but this is just another step in making the gap between failure and resolution a shorter one. I think we should try avoid failing at all, though having comments and better error trapping is certainly a good idea in addition to whatever better solution comes next.

What about checking the license file into source control and designing the build to copy it to that directory? Win, right? Well, right — it solves the problem. With the next deploy, the license file will be on the server, and that means this particular issue won’t occur in the future. So now it must be time to call it day, right?

Still no, I’d argue. There’s work to be done, and it’s not easy or quick work. Because what needs to happen now is a move from a “delete the contents of the server directory and unzip the new deliverable” deployment to an automated build and deployment. What also needs to happen is a series of automated acceptance tests in a staging environment and possibly regression tests in a production environment. In this situation, not only are developers reading the code aware of the dependency on that license file, not only do failures happen quickly and obviously, and not only is the license file deployed to where it needs to go, but if anything ever goes wrong, automated notifications will occur immediately and allow the situation to be corrected.

It may seem pretty intense to set all that up, but it’s the responsible thing to do. Along the spectrum of some “deployment maturity model,” tweaking things on the server and forgetting about it is whatever score is “least mature.” What I’m talking about is “most mature.” Will it take some doing to get there and probably some time? Absolutely. Does that mean that anything less can be good enough? Nope. Not in my opinion, anyway.

Similar Posts

Abstractions

A Metaphor for Software
ByErik Dietrich January 8, 2011September 27, 2012

I’ve recently been reading Code Complete by Steve McConnell (review of the book to follow when I’ve finished). In one of the early chapters of the book, McConnell discusses software metaphors and their importance. I won’t get into too much detail here, and I’m operating off the top of my head from what I read a…

Read More A Metaphor for Software
The Life of a Programmer

Divide And Conquer
ByErik Dietrich March 10, 2011September 27, 2012

What Programmers Want In my career, I’ve participated in projects that have run the gamut of degrees of collaboration. That is to say, I’ve written plenty of software on which I served as architect, designer, implementor, tester, and maintainer and I’ve also worked on projects where I was a cog in a much larger effort….

Read More Divide And Conquer
Language Agnostic

Inverting Control
ByErik Dietrich March 1, 2011October 19, 2014

I imagine that inversion of control is a relatively popular concept to talk or blog about, particularly in object-oriented circles, so rather than do a garden-variety explanation of the term followed by a pitch for using it, I thought I’d take a slightly different approach. I’m going to talk about the reason that there is…

Read More Inverting Control
Anti Patterns

Name Smells
ByErik Dietrich March 18, 2011June 30, 2015

I would imagine that most developers reading have heard of code smells. I’ve also seen various references to other concepts such as design smells (which are probably similar to code smells) or process smells. Generally, when you read these, you chuckle or nod in agreement, thinking that you’ve seen them before. For those not familiar…

Read More Name Smells
The Life of a Programmer

Tribal Knowledge
ByErik Dietrich March 29, 2011January 16, 2018

In my last post, I alluded briefly to the concept of “tribal knowledge” when developing software. I’ve heard this term defined in various contexts, but for the sake of discussion here, I’m going to define this as knowledge about how to accomplish a task that is not self-evident or necessarily intuitive. So, for instance, let’s…

Read More Tribal Knowledge
The Life of a Programmer

Software Craftsmanship and the Art of Software
ByErik Dietrich May 10, 2011September 27, 2012

Context I’m a member of the LinkedIn group “Software Craftsmanship.” I’m not an active member, but I do like perusing the discussion topics. Recently, I read through this discussion on whether software is more of an “art” or a “craft.” This set me to musing a bit, so I thought I would post about it…

Read More Software Craftsmanship and the Art of Software

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Gene Hughson

12 years ago

Very well put. IMO, “the deeper issue” is where we make our money. Too much focus on “make the error go away” without addressing the underlying issues leads to a constant cycle of breaks and fixes that leeches time away from productive work and contributes to building a big ball of mud.

Erik Dietrich

Reply to Gene Hughson

I’m dealing with the aftermath of that pretty heavily right now. Death by a thousand cuts of “it’ll be quicker just to add this in manually for the time being.”

Jordan Whiteley

I’ve taken to creating ‘smoke tests’ on most of my services to address this issue. Basically as soon as the service is up on it’s feet I fire up a new thread that runs a quick query on all the used databases, picks up and plays with all the services, and verifies the existence of all flat files and directories that will be in use.

Better the program crashes 10 seconds after it gets on it’s feet because the database rejects your credentials than letting it crash while I’m at home eating dinner and someone needs their reports.

Reply to Jordan Whiteley

I like this concept a lot, the more I think about it. My brain is starting to wrap around a “production ‘unit’ test concept” where I turn some non-invasive set of sanity checks loose on the production site first thing after deployment.

Reply to Erik Dietrich

I can not express to you the number of times that this has saved me from, “User only has read rights to the error log”, “One of the connection strings is pointing at a staging machine”, “Network drive not mapped”, “Decryption requests are not allowed from this machine”. Also you get many brownie points from your ops guys. It really saves their skin when they switch out hardware and there’s a message that pops up and right away it says, “X isn’t working, it threw this exception{0}, maybe go check Y.” They would much rather see that than to have… Read more »