Friday, May 30, 2008

Focus and Flailing Are Both Okay

Bear with me, context matters for this one:

We had a problem at work, with some behavior we couldn't identify. The behavior itself was fairly clear - performance for a certain file size and certain write pattern degraded over time. The problem was getting that to happen in a test environment so we could poke at it and fix it! At this point, understanding and codifying the behavior is important, as is replicating the problem environment as closely as possible. The plan here gets quite simple: make your environment more and more like the problem environment, and eventually you will see the problem.

This is a very focused state. What you're doing here is actually straightforward. Find some way in which your test environment differs from the problem environment, and change it. Start with the core of your system and work your way out, getting more and more similar along the way.


Now we have a different problem. We can make the behavior happen (hooray!). We have no idea what's going on. Without knowing what's actually important we just start stressing the different factors. There are a plethora of possibilities, so we take our best guess at what might be a factor and start changing.

In this phase, documentation is incredibly important. You need a record of what you've tried and what effect it had. In general, changing one thing at a time is important, and only when that fails do you start changing things in groups.

This is the flailing state. Without a lot of guidance from the problem, you'll be trying lots of different ideas in lots of different areas. Which tests you do in what order is a fairly arbitrary choice; what matters here is gathering data. Feeling like you're flailing is okay, as long as you're doing one thing at a time, writing down what you do, and writing down the effects it has. Flail away, just do it with docs.


At some point, you catch a break. Through hard work, a lot of tests, research, debug logs, hail mary passes, etc.* we figured out what the stressor was. And it was a change between the version that didn't show this and the version that did. Change this one thing, and the system performance goes right back to where it was.

But, and this is important: you are not done yet.

All of a sudden the distinction between interesting and useful gets important. There are lots of things to know: when was this change introduced? Does making this change on the last good version of the software cause us to see the break (thus implying this is really the only aspect of the problem)? Do we have to make this change just in the one area we tried or in more areas. How does it affect performance on other file sizes? Other write patterns? There are a slew of good ideas and interesting things to test. But our goal here is not to characterize every aspect of the system; our goal here is to resolve this one specific issue. So, if it's not about this file size and this write rate and this hardware, it's interesting, but it's not important (not yet, anyway). 

This is focused and prioritized. You're still running a lot of tests, but you need to be brutally focused on what you are testing here. All you're doing is standing on your problem definition and defining it's limits. The minute you step away from the problem definition, you've lost focus and you're not helping fix the problem any more.


Long story short, you can often tell where you are in the path to diagnosing and fixing an issue by the types of tests you're doing. So ask yourself whether you're flailing around or whether you have a focus and a reason to be do each thing you're doing. Then ask yourself whether that's the right stage to be in, given what you know about the problem. Remember, flailing is okay, and a strong sense of focus is okay - you just have to be using each at the right time.


* In our specific case, it was literally a dream that someone had - very cliche but effective!

No comments:

Post a Comment