Let's say we're debugging a problem in which machines in your lab keep being seen as down. The system is such that when a machine is seen as down it could have sensed a problem and turned itself off, or it could have not responded in long enough that the other machines marked it down. About once a day, a machine will be marked as down. Rebooting the machine and adding it back has no ill effects.
The desire is for machines to only be marked as down when there is a real problem with them. Currently, the system may be deviating from that desirable behavior (or there may be an undetected but real problem with the machines).
Here is where it gets interesting. How do you find out what's going on, and get to your desired state? There are several ways to ask the questions:
Option 1: "The system is spuriously marking machines as down when they're fine. The system uses pings to figure this out. Why aren't the pings returning in time?"
This option has a whole lot of assumptions in it, and colors the direction of investigation. Here we're already positing:
- that the machines are fine,
- that the system is attempting to detect the state of the machine,
- that the machine is not sending out erroneous trouble messages,
- that the problem is in pinging the other machines,
- that the system is either not receiving or ignoring the ping responses
If you already have all that information, then this is a perfectly valid question. It shouldn't be the first way you ask the question based on the problem description. This is also likely to make the person responsible for pings a bit defensive that his area is being focused on when we haven't even proven that's the problem.
Option 2: "Machines seem to be getting marked as down more than we would expect. We need to figure out if the machines are really okay, and why they're getting marked as down."
This is generally roughly how I'm going to try to ask the question. Give a summary of what's going on and why it's perceived as a problem. Then offer a couple of concrete directions that, when understood, will eliminate problem areas. The goal is to focus effort without eliminating paths of enquiry; make the first part of the effort about subdividing the problem.
Option 3: "Why is this behavior occurring?"
This fails to state the problem. It merely asks for an understanding of behavior, and lacks focus. It's really only good for giving to people who know the system very well. Left alone there are a whole lot of paths to pursue; having someone who knows the system and the desired behavior will help focus the issue, or the question can be modified to do that.
There are several parts to asking a debugging question like this successfully:
- State (or reference) the desired behavior.
- State (or reference) existing knowledge
- Do not eliminate avenues of investigation or areas of a system unless that is backed by concrete information
- State the success criteria; that is, under what conditions will the question be completely answered?
Just by understanding the system and the problem well enough to articulate the question, you assume some authority. Your listener will believe you have insider knowledge. So phrase accordingly.... and carefully.