When you're looking at an easily reproducible problem in-house, solving it is generally fairly straightforward. Simply keep tweaking logs, configurations, gdb sessions, etc. and reproducing the issue until you have the information you need to solve the problem. When the problem is on a system you can only use indirectly - say, a customer system - it gets a lot harder, and experimentation is often out of the question.
One of the techniques I like to use here is what I call Time-Based Event Modeling.
In a nutshell, time-based event modeling is constructing a timeline of events, and continuing to construct more and more micromodels of events until you have a plausible handle on what the problem is. This technique assumes we have an "event" - a system crash, performance slowdown, inability to use some feature, etc. Typically, this is what the customer is calling you about, claiming to have found a bug. It further assumes this is something that will require some analysis; the problem isn't straightforward.
Background and Assumptions
Most systems work on patterns. The user does X, which triggers Y. There are also background processes and events - things that the system initiates on a regular basis without end-user intervention. Sometimes problems are caused by the interaction of those patterns and of those events. Doing one thing isn't an issue, and a background process isn't a problem, but when the two occur simultaneously then you get unexpected behavior.
Your goal then, with time-based event modeling, is to find the rhythm of those patterns and see how they interacted. Once you can see the rhythm of the system as a whole, and all its moving pieces, then you can see the break in the pattern and find the event.
So What Do We Do?
- Construct a timeline of user-visible events at the highest level. This is what your user knows, and it's a good place to start. Literally write this out with times and dates.
- Construct an underlying user-driven timeline. These are the things your system does that were caused by the user actions. Put these on a timeline with the same granularity as the first one you make so you can see what the system is doing.
- Add other system patterns. These are cleanup processes, log file rolls, etc. Put these on a third timeline with the same granularity as the first two.
- Overlay the three timelines you made. Look at the cycles, and seek out intersections of patterns. Did something change? Is there some sequence of events that occurs together occasionally? How does that relate to what the user sees?
- Work out what was special about the system at that time. Did something change? Is there some sequence of events that all intersect at the time of a problem? Did anything take an unusually long time with no obvious explanation?
- Look at what patterns failed. Look at the patterns around the time of the problem. Did those patterns repeat sometime when the problem did NOT occur?
- When you find circumstances unique to the problem area, then start digging - there's where your problem is hiding.
When to Use This Technique
This is a good technique when we have a non-reproducible set of conditions, or when we can't find a smoking gun. When you suspect a race condition, some over-time deterioration of the system, or the interaction of numerous components, then consider using time-based event modeling.
When Not to Use This Technique
Don't bother with time-based event modeling if the problem is straightforward. For example, if the problem is "I can't burn this CD", and the user is trying to write to a non-writeable CD, then you don't need any complex techniques to figure out what the problem is.