Friday, January 14, 2011

An Object Lesson in Unsafe Changes

Any change is risky. Some changes are more risky and others, and some changes have more visible risk than others. But every single change is a risk.

What follows is a true story.

We were working on a change that was a little bit hairy, fixing a bug in the way we handled some data on upgrade. We were able to write a test that showed the problem, and to fix it on the mainline code. Great, this was humming along nicely. Next step: merge it to a branch so we could get the fix into the next release.

The trouble is, we couldn't merge the test because it depended on some things that simply weren't in the branch. The fix itself merged onto the branch just fine, but we needed to find a way to test the fix on the branch. "Easy," I said, "we have a tool that prints out this information that already exists on the branch. We can use this tool and compare the data before and after upgrade. It's a bit tedious, but it'll be fine to get through this release."

As it turns out, it was almost that easy. We weren't printing all the information, but we were printing about 80% of it. So we just had to add a little bit to the tool, and do the merge, and we were off and running. We did the merge, added the extra printouts to the tool, ran a couple of quick tests, and checked it in. Simple!

(Here is where you should be hearing ominous music.)

The next morning, I came in to discover that:
  • the machine that runs the automated nightly tests was spewing out of memory errors and swapping. A lot.
  • one of the testers on my team was staring at his screen saying, "huh. That's weird."
But the only thing that got checked in on this branch was our change. Oh dear.

It didn't take long to figure out that the "simple printing change" had a bug in it. This small bug caused us to print out 13 MB of empty spaces every time we ran the test program in a particular configuration. Repeat that across the number of times we use this test program and all of a sudden it's using huge amounts of memory. Oh, and it's causing the tester's terminal to look like a bad '80s programming movie, with the cursor racing down his terminal.

The fix was trivial. But it was a reminder that nothing is safe. All we did, after all, was print something incorrectly some of the time, and we wreaked a little bit of localized havoc!

Fortunately, there weren't any long-term effects. It was a simple bug that we found almost immediately and were able to fix quickly. Most of the team was never affected and it certainly never hit the field. It did reiterate, though, why every change is a risk.

Mea culpa.

No comments:

Post a Comment