Tuesday, December 9, 2008

Knowing "Normal"

One of the things I preach on is spending a lot of time looking at the system, particularly through the logs. Depending on your system, substitute something appropriate for "logs" - reports, database, network traces. The point is that you need to be looking under the covers.

Why do we look under the covers?
Immediate problems are the kind where you try something and it fails. For example, maybe you shouldn't have permission to delete a file but you can delete it anyway. These are generally fairly easy. It's really just a matter of identifying the relevant system state and then performing the failing action.

Then there are indirect problems (I've also heard them called second-order issues). These are the things that go wrong only because some other thing or things happened. So you can be in the same state multiple times and depending on what else has happened, your issue may or may not reproduce. It depends on whether you've triggered a time bomb or not. Figuring out what else has happened is the trick here, and that almost always requires that you look under the covers. Your answer, or at least the path to your answer, is in the logs (or network traces, or whatever).

Great! What are we looking for?
Here we're looking for one of two things:
  • A pattern. The pattern might be subtle - some sequence of actions or states - or not, but keep in mind that not all system actions are user actions.
  • Anything that isn't normal. Something out of the ordinary happening - garbage collection earlier than it usually does, or maybe a database index not present when it usually is - can be a a big clue to the problem. These are often subtle, but tracking them down is usually worthwhile.
Yes, normal. If I'm looking for things that are not normal, I have to know what normal is for my system. I'm not going to notice that a log rolled 5 minutes early if I don't know how often logs usually roll. I'm not going to notice that my checkpoint was late if I don't know they happen every hour at 1 minute 28 seconds after the hour. I'm looking for deviations from normal, and I'm only going to know what the deviations are if I know what normal really is.

This is where your notes come in. Remember how we talked about noticing with purpose? Here's where you haul those notes you took out and start looking for a pattern. We're getting under the covers and looking to find out what patterns exist in our system and what patterns matter.

The single best way I know to start to see these patterns is simply to live in the logs (or network traces, or whatever) for a while. And then talk about it:
  •  Explain what's going on in a section of the log to yourself or to someone else. Do it out loud so you can't gloss over something.
  • Follow a thread or process and figure out it's periodicity. Repeat with another thread or process.
  • Do an action and describe everything that happens with that action. Do this one out loud, too.
Eventually you'll start to understand the system as it typically behaves. Once you have a feeling for this "normal", then and only then can you see deviance from pattern. So find your normal, and then go find your issues.

No comments:

Post a Comment