Thursday, October 8, 2009

Magic Words

Just like there are some magic numbers that can point you to the source of a problem, there are some magic words that also have power. The words you use to describe a problem will color how other engineers (developers, other testers, managers, even you) look at the problem and what they do to track it down.

Just like with the numbers, some of these are common to many systems:
  • Crash. Most people will be looking for a core file and an uncontrolled shutdown in this case.
  • Hang. Often you'll get asked how long before you mark something as being hung. Also, this word means that absolutely no progress is being made. If it's progressing at a snail's pace, it's still not a hang; it's just really slow.
  • Failed. This one means completed with error. It doesn't mean "still going".
  • Wedged. This is imprecise but generally seems to make people think deadlock.
Other magic words are more specific to your product. For example, in our product we discuss:
  • Data loss. The system could lose data in this scenario (YIKES!). This one raises red flags all over the place.
  • Failure. This word means system change that resulted in the loss of availability of a system component. Generally its used for hardware failure (of a disk, node, power supply, network connection, etc). Generally "failure" does is modified by some component name (e.g., power supply failure), and is not used for the more general "software didn't work" case.
  • Simultaneous Failure Vs Sequential Failure. In a redundant self-healing system like ours, number of failures (see above) matters, as does whether the second one occurred at the same time as the first or after. Depending on which one happened, a whole different debug path will be invoked.
What other magic words are there? I'm pretty sure I've missed some.

2 comments:

  1. How about "blocked"? We only use that one to indicate that a failure is holding up other work.

    ReplyDelete