Tuesday, May 27, 2008

Hypotheses, Theories, and Explanations

In the land of tracking down issues, there are some important words that really shouldn't be mixed up.

Hypothesis. A hypothesis is basically a guess. It's generally the first step in coming up with an explanation. 
Example: "We ran out of memory!" 
Hypothesis: "A long-running job was reading a lot of objects into memory and just never flushed them."

Theory. A theory is a hypothesis with some backing. You've taken your hypothesis and compared it to all known facts, and it's holding up well. 
Example: "We ran out of memory!"
Theory: "A long-running job was reading a lot of objects into memory and just never flushed them. We see a Java core that shows an out of memory error at that time from that job. We also have logs showing the job starting and in progress, but never completing. We can make a long-running job exhibit this behavior in a test system."

Explanation. An explanation is a theory that can be reproduced on demand, with the same starting state and the same result as the original problem. Resolving the issue shown by the explanation resolves the original problem (i.e., causes it not to recur). Basically, an explanation is a theory that's been proven in the field.
Example: "We ran out of memory!"
Explanation: "On a system configured like the client, we run out of memory due to the proposed theory (see above). Modifying the behavior so that long-running jobs periodically dump their objects from memory results in no out-of-memory behaviors at the client site.:

In short, until you've reproduced it, you have a hypothesis. After you've reproduced it on another system, you have a theory. When it's happened again (or been fixed and shown to not happen), you have an explanation.

I started writing this post thinking it was going to be a fairly straightforward attempt to clarify some phrases that are often used in an overlapping manner, but it occurs to me that this might be a bit controversial. The only real strong (and completely tangential) thought I have here is that you don't know everything. You may know all the relevant things, but you may not. So until you've seen the problem fixed in the place the problem originally occurred, then you cannot say with 100% certainty that you've reproduced the issue. You may have reproduced a very similar issue. The vast majority of the time, you'll have the same issue; it's that last little bit and those really subtle issues that make life really interesting.

No comments:

Post a Comment