Monday, October 20, 2008

High-Volume Bug Lists

There's a bit of back story to this:

Last Thursday, I wrote about parent and child tickets, which we use internally to help with root cause analysis and that period between analysis of an issue and fixing the issue. Glenn Halstead commented that it "seems to assume that all test failures result in a defect being logged."

So I thought I'd step back and talk about how we handle our automated tests and repeat failures, etc.

Here are the basics:
  • Every night at 7:05 pm the automated tests start. (We call these tests "nightly".)
  • The test runner machine triggers a build.
  • The test runner machine then starts running all the automated tests. There are details about how this is done, but we can talk about that later.
  • (Hours pass)
  • In the morning, someone in QA checks on the progress of the automated tests. We use a script that gives us a quick overview of progress.
  • When the automated tests are done, the test runner machine is quiet.
  • QA then performs what we call triage.
So what can happen in triage?
There are several things that can happen:
  • The test finished successfully. (Hooray!)
  • The test finished but there was a failure (by this we mean assertion failure).
  • The test never finished (by this we usually mean hung or timed out)
  • The test finished but threw an error not in an assertion.
  • The test finished but left its infrastructure (machines, network configuration) in an unclean state.
That's a lot of options, I know. Let's address each in turn:

The test finished successfully.
What It Means: This is the standard test passes case. This case covers the vast majority of tests.
What QA Does: Nothing. The passing test result is already in the test logs, and QA takes no further action.

The test finished but there was a failure.
What It Means: This is what I think of as the standard "test fails" case. There was an assertion, and the actual result did not meet the expected result.
What QA Does: Logs or updates a bug. This may be a new bug or added to the bug for an existing issue, but either way, the failure gets noted every single night right in the defect tracking system. This makes it a lot easier when we're later looking at a bug and trying to identify frequency of failure, when it started failing, if it's still happening, etc.

The test never finished.
What It Means: One of two things happened here. The test may have timed out and been cut off. Or the test may have simply sat there "forever" (where forever actually means until it had been far longer than the test should have taken and a human went in and killed it). This happens for us when a test is waiting for a copy to finish over a mount that has died, for example.
What QA Does: Logs or updates a bug. Again, the bug might be new or added to an existing bug. What we're looking for here are bugs that aren't in assertions. Sometimes these are test infrastructure bugs (the test killed the mount underneath the file copy, for example), and sometimes this is a symptom of an issue that the test simply doesn't directly assert for (say, for example, a kernel problem not properly closing the connection when timing out NFS mounts).

The test finished but threw an error not in an assertion.
What It Means: These are what I think of as "crufty" bugs. They're not going to kill you, probably, but they sure make later debugging of client problems harder because there's a lot more noise in the logs. And often they're indicators of an inefficiency. For example, maybe we're trying to use an interface before the network is configured. As long as there's a retry it'll work eventually, but it means you have a logic flaw in your code.
What QA Does: Logs or updates a bug. As usual, the bug might be new or added to an existing bug. These tend to be lower priority bugs and may not get fixed as quickly, but they are still in the defect tracking system. Where this comes in really handy is when you're analyzing logs from a client site, and you see a bunch of errors. A quick search of the defect tracking system can reassure you that sure it's an error, but it's not fatal, and you need to keep looking for the real problem.

The test finished but left its infrastructure in an unclean state.
What This Means: Our tests are expected to clean up after themselves: remove mounts, tear down special network configurations, etc. This way the machine that the test is running on can be used for another test and we know it's clean. When this doesn't happen, a machine "leaks" (i.e., cannot be used until it's manually fixed).
What QA Does: Handles an automatically logged bug. This is the only instance in which our test infrastructure logs a bug automatically, and QA cleans up the machine and figures out if there's a bug or if it's something like failed hardware.

There's a definite pattern there: if the test doesn't succeed completely - including setup, running without error, passing all assertions, and teardown - then it's almost certain to result in a bug. That bug might be new, or it might be a notation in an existing bug. This means we have a pretty high volume of bugs, but it also means our collective test history is pretty much contained within the defect tracking system. For my part, I think it makes pattern analysis easier, and it makes sure we don't miss a bug - any issues we're not directly asserting on still get caught and handled.

No comments:

Post a Comment