Thursday, March 18, 2010

Patching the Problem Cause

What can we do to make sure this never happens again?

Usually this question gets asked after something bad and embarrassing happens. Maybe it's a problem at a customer site, or a release you have to retract because it just doesn't work in the field. Either way, you're now embarrassed, you have fixed the immediate problem, and you're looking to avoid being embarrassed again.

So you ask the question: "What can we do to make sure this never happens again?"

On the surface it sounds like a good question. Something went wrong, and it went wrong in a way that was more public than you'd like. How do we make sure that this doesn't go wrong in public again?

But... it may not be the right question to ask. It runs the risk of fighting yesterday's war and focusing too narrowly on the specific problem you just had. After all, this particular problem is already fixed. We have a clear underlying goal: to avoid being embarrassed in this way again. The real challenge is to fix not this problem but to prevent similar problems, too. Let's look at what kinds of things we may need to do to accomplish this goal.

  • Find the hole that let this out and close it. Do we have a gap in our regression tests? Are we squeezing design or implementation or test and this is a corner that got cut? This root cause analysis is essential to prevent similar-but-not-quite-the-same problems from occurring.
  • Add a test for this particular problem so we don't regress. This may or may not be worth the cost of the test (in time and in resources). Before we go running off to do this, it behooves us to consider whether we are prone to regressions and whether this is the type of problem that will respond to that. Memory leak in a component? It probably makes sense to run that component under Valgrind to detect memory leaks before we ship. Design flaw? A test is probably not the right place to fix it; it's way too late by then.
  • Look for other instances of the same problem. Sure, we found this problem, but what related or similar problem might we have missed? Maybe we found the SQL injection flaw in the signup form; that means it might be a good idea to look for the same flaw in the login form. Depending on the root cause of the problem, this may mean looking at a lot of other things or very little extra work.

It's understandable to want to prevent something bad from ever happening again. However, just patching the cause of the problem is not sufficient, so avoid the knee jerk reaction and do some root cause analysis on the reason the problem came into existence, not just the problem itself. This will give you more chance of preventing similar problems, and help make sure the problem you just had doesn't come back, either.

1 comment:

  1. Usuallly this gets addressed by the "Root Cause Analysis" or by the "Project/Product Post Mortems"...