Some of these are common on many systems:
- 86400 seconds: As in, "the test timed out after 86400 seconds". This is a day. As in, the darned thing didn't finish up in a whole day. Oops.
- 2^32 or 2^32 -1: If you're on a 32 bit system, you're starting to wrap address space here. Look for really large negative numbers where you're expecting positives, etc.
- 45 seconds: the default connection timeout for network mounts in Windows. If something fails after this long, you're looking at a timeout probably.
Some of these are specific to your system. For example, in our system we know:
- 6 minutes: The RPC timeout for a query to a remote system is 2 minutes, and it does 3 tries. 6 minute waits mean you're hitting this.
- 5 minutes: The frequency of heartbeat checks for certain operations. Failover after 5 min means these never succeeded.
- 25: default max number of simultaneous connections. Can be increased indefinitely, but if you start to see slowdowns or connection timeouts and you have 25 clients in use, you're probably going to want to change it.
- XX: Java heap size. (I don't remember this one off the top of my head, but I know it when I see it)
What is interesting is not simply knowing the numbers. What is interesting is the shorthand debugging that it offers you. For example, if support calls up and says that a customer is complaining that the management functions are "very slow to load", and it turns out to be about 6 minutes, then the first place I'll look is to see if it's trying to talk to a remote system, and if there's some sort of problem in that communication.
It's not perfect, but knowing your system's magic numbers can often be a shortcut to finding its problems.
No comments:
Post a Comment