Monday, May 12, 2008

Night Is 20 Hours Long

We have a problem. We have too many tests. Or to be far more precise, our nightly test runs take on average 20 hours to run. That doesn't leave us much time to react to what nightly has found!

The Current Situation
The nightly tests start at 7:05pm, and run until approximately 2 or 3pm the following day. At that point, the QA person of the week generates a report of all the failures, goes through each failure, logs (or updates) bugs, and then publishes this report.

Some of the Reasons
There are a lot of reasons nightly takes this long:
  • Many Long Tests. Some of the tests simply take 8 or more hours to run. As we find these we move them to a weekly run, but I don't think we've found them all.
  • Lack of Machines. Because developers and QA (you know, actual humans!) use the lab, too, we can't allow the nightly run to take over every single machine. So we can only run the nightly tests so much in parallel; some of it has to be serial.
  • Inefficiencies in Usage. This is actually more rare than you would think, but sometimes tests hang on to machines when they shouldn't, and so other tests can't run.
  • Hangs. Sometimes tests hang. They simply fail to return for hours until some human notices.
So, what do we do to make nightly runs complete faster? We have a lot of options, so let's break things down. First and foremost, a complete break is not an option. Whatever we do, we have to keep running the nightly tests in the meantime; our development model is predicated on that level of regression testing.

Small Changes
Small changes can have big effects, and they're relatively safe and cheap. We'll do them first:
  • Move Long-Running Tests to Weekly. Seek out more tests that take time and get them out of the nightly run. Put them in a weekly run that doesn't happen as often.
  • Test Hangs Are High Priority Bugs. If it's a hang, then that's a bug. Sure, the system behavior may be legitimate, but the test is broken. We make these high priority bugs and we get them fixed quickly.
  • Buy More Machines. Sometimes throwing more resources at the problem really does work! Running tests in parallel can shorten your total run time, and with more machines, the humans and the nightly are happy.
Medium Changes
Some changes are a bit bigger, but still not overly drastic.
  • Autogenerate Triage. Why wait to the end to generate triage? Have test failures available to the triage person as the tests fail. That way the bug logging/updating, etc. can take place throughout the day, and the report gets much easier to create.
  • Use an Existing Build. The first thing the nightly test run does is trigger a build, which takes about 90 minutes. We have a continuous integration system; we should just use the last known good build.
  • Merge Tests. If there are multiple tests with the same setup and tear down, but that exercise slightly different teams, maybe they can be more efficiently run in a single test or single suite.
Large Changes
If all of the above changes don't get us to where we need to be, then we have to consider large changes. These are fairly high-cost (in time at least) and are higher risk.
  • Run Constantly. There's an entire blog entry in this one, after I've thought it through a bit more. The basic idea is that you run a "nightly" run all the time, and it works more like a continuous integration system. It just picks up a test and runs it against the current build. You have to put logic in there to make sure all the tests get run as often as possible, and it changes the concept of "nightly run reports", but it's a way of rolling with the test growth instead of fighting it.
  • Aggressive Test Killing. Set an upper limit on test runs. If they don't run in that time, then they get cut off and that's a bug. This puts a huge onus on the test developer, and of course the duration is something that will vary by test type, but it's would certainly be an effective way to prevent  a few long-running tests from destroying an entire night's run.

I don't know how far we'll have to go to get out of our current problem, but we're definitely working toward a solution.

How do you handle test duration creep, particularly in automated tests? What do you do to balance the desire for tests with the desire for fast results?

No comments:

Post a Comment