Friday, August 8, 2008

Evidence and Proof

One of the responsibilities of QA is to look at things that support can't figure out, and today one of them came up. It looked fairly simple on the surface. We have a multi-machine system, and there is a log collection feature. Basically, it goes out, grabs all the logs from all the machines, tars them up, and presents them for download - very convenient. In this case, some of the logs were missing.

So I dug in.

After tracing it through code inspection and log file inspection, I tracked down the real problem. Basically, it tried to bundle up the logs on each of the machines and then transfer them for inclusion in the full log bundle - this is normal. On three of the machines, that process failed, and on five of the machines it succeeded.

Now what?

I traced it a bit further on those machines specifically and tracked it down to a problem area. Here's what the code does (note that this is pseudocode, of course):

for each machine m in machines {
    nfs mount a drive on the machine
    grab the log tar file

Now, on the five machines that worked, this worked fine. On the three machines that failed, this step didn't complete successfully, but it didn't throw any error messages at the log collection script level. Furthermore, I found evidence in the auth.log that the mount and unmount succeeded. So now I've narrowed it down to that log tar file. My theory goes that the tar file simply didn't exist or wasn't complete (it varied by machine). Earlier on, when it was making the tar file, that failed or didn't finish.

Here's the problem.

All the evidence points to my theory being correct. I have no way to prove this, though. All I see in the logs is a mount, a copy timeout, and an unmount. There's no log in our script that we successfully created this tar file or didn't. There's no evidence in Linux syslogs, auth logs, kernel logs, message logs, anywhere, that a tar file got created (after all, creating a tar file is simply creating a file, and that's not really interesting enough to log at the OS level!).

So what do you do next?

I eventually found my proof - an error message buried in one of our other logs having to do with the thread duration of the thread that was doing the tar. But it made me think:

When you can't prove something, how strong does your evidence have to be to convince you?

We don't live in a world of black and white. It's simply not possible to prove every assertion we make, particularly when we're outside of our (semi-controlled) test environment. I don't really have an answer for this, so I'll throw it to the world out there.

How do you handle these situations?

No comments:

Post a Comment