Wednesday, April 16, 2008

Working Definitions of Failure

There are several failure terms that get thrown around rather loosely. That is more than a little imprecise. All of these terms have general definitions, but let's be even more precise than that; there will be a working definition that gains precision for your environment.

So, let's define some failure types:
  • Abort. Abort means some process caused your program to stop. There was no failure state, but there was no lack of failure state. The program (or process) simply stopped. For example: on Windows, IOMeter will abort if the underlying mapped drive goes away; it doesn't hang, it doesn't shut down. It simply stops the test.
  • Fail. Failing means that something was asserted and got an unexpected value. This is trying and failing. Failure is an active state.
  • Hang. Hang means that the program simply stops doing anything; no perceivable action is taken. This one is very dependent on your application - loading a web page taking half an hour is probably a hang; copying a 1 TB file taking half an hour is definitely not a hang. At my current place, we almost never say hang. We say "nonresponsive for time X" or "has been performing action A for time X, which is unexpectedly long". This is mostly because we have a lot of long-running operations. 
  • Takes a Long Time To. Takes a long time to means that you expected some operation to take duration X and it took significantly longer. Precision is better here - took 20% longer than expected, etc. Sometimes you can't have precision, though; in that case, allow at least 25% over the expected duration before you start saying that something is taking a long time. If it's under 25% longer than you expect, say "longer than expected", which sounds less judgemental!
  • Crash. Crash means that the program you're running stops and is no longer running. If you do a process listing (ps aux or whatever), you don't see the program any more after a crash. Don't say crash unless you mean crash; that's a word that is likely to cause a bit of panic.
  • Error. Error means that something didn't happen as expected, or happened and was not expected. An error is not necessarily a failure; it's not specifically tied to an assertion. An example of an error is a test that dies in the setup method.
In the end, all of these are just words. There are two considerations when using these words: precision, and consideration. Use these words precisely so no one goes and chases a red herring. Use these words carefully because flinging them around only makes you look judgemental.


No comments:

Post a Comment