Wednesday, October 7, 2009

Magic Numbers

There are some numbers that I call magic numbers. These are the special numbers that have meaning in the context of your system. These numbers are typically diagnostically important, or triggers to identify problems or potential problems.

Some of these are common on many systems:
  • 86400 seconds: As in, "the test timed out after 86400 seconds". This is a day. As in, the darned thing didn't finish up in a whole day. Oops.
  • 2^32 or 2^32 -1: If you're on a 32 bit system, you're starting to wrap address space here. Look for really large negative numbers where you're expecting positives, etc.
  • 45 seconds: the default connection timeout for network mounts in Windows. If something fails after this long, you're looking at a timeout probably.

Some of these are specific to your system. For example, in our system we know:
  • 6 minutes: The RPC timeout for a query to a remote system is 2 minutes, and it does 3 tries. 6 minute waits mean you're hitting this.
  • 5 minutes: The frequency of heartbeat checks for certain operations. Failover after 5 min means these never succeeded.
  • 25: default max number of simultaneous connections. Can be increased indefinitely, but if you start to see slowdowns or connection timeouts and you have 25 clients in use, you're probably going to want to change it.
  • XX: Java heap size. (I don't remember this one off the top of my head, but I know it when I see it)
What is interesting is not simply knowing the numbers. What is interesting is the shorthand debugging that it offers you. For example, if support calls up and says that a customer is complaining that the management functions are "very slow to load", and it turns out to be about 6 minutes, then the first place I'll look is to see if it's trying to talk to a remote system, and if there's some sort of problem in that communication.

It's not perfect, but knowing your system's magic numbers can often be a shortcut to finding its problems.

No comments:

Post a Comment