Tuesday, April 28, 2009

Outside In

When faced with a problem or potential bug, the logical first step is to reproduce it. Simple, right? Just redo the thing you were doing when it happens!

Sometimes that's harder than it seems. When a bug is not simple to reproduce, there are a couple of ways to approach it. Fundamentally, they break down into approaching the problem from the outside in versus approaching the problem from the inside out.

Approaching the problem from the inside out is my usual approach. I start as simple as possible, and keep adding circumstances until it reproduces. For example, if the program crashed when I clicked "Add Node", I would do this on a clean system:
  • click "Add Node" (shoot! didn't happen)
  • make sure I have the same number of nodes already in the system, then click (shoot!)
  • make sure I have the same node names, then repeat all of the above
  • ...
  • ...
  • ...
  • make sure event log collection is going on for that node type, with the same number of nodes of that type, then click (Got it!)
This often works, but sometimes there's a mysterious something going on and you just can't pin down the problem this way. When going inside out doesn't work, consider going outside in.

Finding a problem from the outside means that you don't just blindly do what you did before. Start with the analysis, and then reproduce. It will feel backwards at first, but it becomes more natural over time.

The basic steps to an outside in analysis are nothing you wouldn't do; it's just the order that's different.
  • Identify the proximate failure point. With a crash that's pretty easy. Other times it may be a bit more difficult, but we're not looking for the root cause, only the immediate problem you saw.
  • Look through the logs to find the failure point. Once you know what your proximate failure is, go find that point. In most logs, you're just going to look right at the time it happened. Any errors? What was it doing? If it crashed, are there cores or assertion failures?
  • Figure out why that failure happened. What was missing? What extra something was there? What does the code say? Is there a workflow or sequence here? A dependency? You're only looking to go one step back.
  • Repeat. Keep working your way into the problem, one step at a time. Maybe your crash occurred because something was nil and shouldn't have been. Once you know that, go find what was nil. Once you know that, go find why it was nil.
Eventually you'll have started to understand the problem. You may or may not make it all the way to root cause here. Either way, having worked through the failure chain, you can characterize much more cleanly what circumstances are required to get the error to occur. It will make reproducing the problem more reliable (no one's saying it will be easy now, though - sometimes the setup is difficult!), and it will make fixing it more sure.

It's easy to form habits sometimes: "I found something. I should reproduce it." When that doesn't work, or when it's not giving you the results you need, consider your alternatives. Sometimes you need to turn your plans on their head - and work from the outside in. 

No comments:

Post a Comment