Wednesday, July 2, 2008

Time Bombs

In some areas of products, when something fails you know it immediately. You click a button and the system throws an error, for example. Some failures are trickier, though; they may fail and you may not know it for a while. Congratulations, you've hit a time bomb.

There are two kinds of time bombs: those caused by something in the product, and those caused by something outside the product.

Product-Based Time Bombs
Often, time bombs inside the product are in areas where there is a secondary process not essential to the basic function of the system. Think monitoring processes, logging processes, cleanup processes, etc. These are things that can fail and the system behavior won't be any different... on the surface.

Noticing a product-based time bomb before it causes a problem is a matter of proactively looking at logs and finding the clues. Sometimes there's an error or warning with no effect, sometimes there's a process not running that should be. The symptoms vary based on feature.

Environment-Based Time Bombs
These are real fun ones. Something not in your product changes and causes a problem down the line. Culprits here include DNS or other network changes, and changes in any third-party systems with which your system interacts.

These are particularly tricky because there will likely be little to no indication of a problem in the logs. Your best bet here is to cultivate a good relationship with those who control these areas and make sure they let you know when something happens.

Finding a Time Bomb
The time to think of time bombs is when you're faced with an error or failure that doesn't seem to have a proximate cause. This is particularly true if it's something that depends on a feature or process that doesn't get touched very often. Think of a time bomb if you have:
  • A full disk
  • A failed log parse
  • Difficulties talking to machines that are up
  • A monitored thing fail and no expected result (notification, restart, takeover, etc)

When you suspect a time bomb, there's really only one way to figure it out. Find out when the background process was used successfully, and start looking at whether that failed.

And when you have found your time bomb, make sure your fix includes preventing the failure and making any failure shout more loudly so you notice it.... before the bomb goes off.

No comments:

Post a Comment