We're making a big switch in the lab; we're upgrading the underlying operating system on our machines. This is something that we kind of have to do - wouldn't want to be on an ancient OS because it only makes things like security patching harder. I can't say it's my idea of a good time, though - it's a lot of work!
Anyway, once we've done all the basics - setting up FAI, setting up build and test systems, getting the lab migrator to run (so we can move machines from old to new and back again), etc. - then we can start the tests.
We start slowly - one night on the new OS, and then not again for a while. This way teams get a chance to go through their problem areas and fix them before they get hit with the same ticket again. At this point, too, our number of failures is generally quite high, and one or two problems will take out entire swaths of tests ("can't talk to the NTP server: 42 machines can't be cleaned up!" or "stunnel configuration is different: killed two entire suites!"). It's a fairly quick and easy way to find problems that affect each of the teams. It's a learning experience, it makes a bit of a mess, and we all join in cleaning up after it.
At some later point, then, we make the decision that it's time to switch. At this point it's time to get strict. So now we worry about things in the following order:
- First, resolve compilation errors. If it doesn't compile, not much is going to run.
- Second, resolve bugs that cause machines to leak. If a test causes machines to not clean up after it's done, then they're not available for other later tests. This causes the entire lab to grind to a halt. Generally this is accompanied by cries of "we're out of machines!" and tests just not finishing because there's no machine for them to run on.
- Third, resolve bugs that are likely to hide other bugs. If I have a bug in my test setup, who knows what will happen when I get to actually exercising the thing I thought I was testing!
- Fourth, handle everything else. Once you've gotten through the first three items, then just start fixing bugs according to your preference.
The overall goal is to expose the bugs. Get things running, then follow up with getting them running right. Hopefully this whole process doesn't take you too long, but sometimes when you're mired in the land of "this underlying thing broke a lot!" it helps to step back and think about what to prioritize. You can make the whole process go a bit more smoothly if you think for a minute, then leap in and start fixing.
Good luck, and happy resolution!