Thursday, April 30, 2009

The Bad Days

We run a large suite of tests every night - it numbers in the thousands. As the suite has gotten rather large we've built some infrastructure around it, and we watch it pretty closely. The suite actually catches things, too - race conditions, regressions, deadlocks - that are hard to find manually.

When things are running well, this isn't a huge deal. It's an hour or so every morning checking to see if there are new failures, etc.

And then there are the bad days....
The day someone checks in a bug in the infrastructure and 500+ tests fail.
The day someone adds a new package, does it incorrectly, and in the morning there are 150 machines stuck in tests.
The day a hard drive eats itself and takes out an entire set of 200+ tests.

Those days are enough to make you think you should scale back your regression suite. They're enough to make you think you should just stop.

Don't stop. Stopping is overreacting.

Instead, ask yourself how you can make this powerful tool (remember, this thing finds some deep, nasty, hard-to-find bugs) a bit friendlier. Maybe you can make cleanup easier. Maybe you can make it easier to test changes to the infrastructure. Maybe you can improve error reporting so you can handle the 200 identical failures in a single stroke. Either way, maybe you can address the problem while keeping the good.

Your first reaction isn't always the right one, particularly on the bad days. In our case it's a large test suite. In your case it may be something else. Either way, don't throw the baby out with the bath water, and don't make a big decision rashly, no matter how bad the day is.

For every bad day, you'll find a good day. Be patient and you will get through it.


  1. As someone who does not live with big automation every day, I'm curious - is the automated suite tested like other code projects are tested? For instance, do you test the suite by running it against a static data set after every update?

  2. The test infrastructure is code just like any other code, and it sometimes has bugs, so yes, we test it. (We'd go nuts otherwise!)

    I'll write about this in more detail someday, but basically we have a couple of sets of tests (called Lab tests and Infrastructure tests) that run alongside the product tests. These lab tests and infrastructure tests verify things like the behavior of the AsyncRead utility that we wrote, or the MockServer utility that most of our tests use.

    Without the infrastructure tests, I suspect we'd have more bad days!