Friday, April 3, 2009

Scaling Up

We sell large systems. These "storage in the sky" things are really really easy to put data in, and deleting is not actually a common operation for most of our customers. After all, an online archive system is pretty much made for companies that have to keep data around for 5 years, 10 years, 20 years just in case they get audited. (SOX was great for storage companies!)

That's great, but it poses a rather large testing dilemma. Can we really test a system as large as sales can sell?

The short answer is not really.

Let's say our system is 250 TB. Even if we were to do nothing but pump data in 24 hours a day at 20 MB/s per machine, it would take about 151 machine days to fill it. Multiply that by the number of different releases you have to support and the myriad interesting tests (fill with small files! fill with huge files! different directory structures!), and you've got a really big job on your hands. The hardware costs alone are enormous. I'm also pretty sure sales can dream bigger than that.

So, we can't actually expect to do everything our customers do on a system level. What now?

After all, big system and full system are both boundaries, and are areas where we would expect to find bugs.

There are a number of things we can do to help identify and prevent defects related to large and/or full systems:
  • Code inspection. Stop treating this like a black box, and start looking for where there might be breaks. Think queues, memory allocation, references to data, etc. You're already looking at the code (I hope); ask yourself as you look what happens if item X gets really large or really numerous.
  • Unit tests. Let's say I find in my code inspection that we're creating a pointer to every piece of data written. I don't have to actually write all the data; I can create a test that simply creates the pointers and then exercises them (retrieves one, deletes some, etc). Much faster, much cheaper to run, and it'll show me how that particular data structure scales.
  • Added constraints. If you can't scale the system up, you can sometimes scale the environment down. Using Java? Set the heap to half the size you normally run. Ship with 4GB memory? Try running it on a system with 1GB of memory. That way you hit the constraints a lot earlier. Your false positive rate is probably higher, but it can expose some edges.

The common theme here is faking it. Treat the system by its component parts instead of as a larger mass. System-level tests are great, but sometimes the more effective test is one that attacks the potential underlying problem directly. You have to think a bit harder to anticipate the potential problem, but you'll test it more effectively, faster, and cheaper in the end.

No comments:

Post a Comment