Thursday, August 5, 2010

Statistics As Oracles

We use oracles when we test - things to which we can compare the behavior of our system. Usually an oracle is something we consider correct (just like the one at Delphi). Sometimes, the challenge of a test is identifying an oracle.

And sometimes there is no oracle that is right all the time. Sometimes the best oracle we have can only tell us if something is probabilistically correct (i.e., whether it's "plausible").

For example:
Let's say we're testing a system and that system distributes data among machines according to the first letter in that data. Let's further say that the data is all letters and there's a lot of it - billions of entries.

| 1: A->F | 2: G->N | 3: O->S | 4: T->Z |

We ran the data through and want to know if it got distributed correctly.

How do we do it?

There are a lot of ideas we can try:
  • grep in each of the buckets! Here's a hint: at any volume, you're just going to hang up your computer - this one doesn't really work.
  • run a sample data set through. This one works for small scale tests but won't tell you if you've got a problem in the field or a subtle bug at volume. This is a great first test, but not an only test.
  • statistics. This one is powerful, although not deterministic.
Let's talk about creating an oracle from statistics. If, for example, each of our entries is an English word. We can look up the frequency of the first letters of English words. That can be an oracle. Your letter counts should be about equal to the overall frequency.

Because this is a statistics-based, or probabilistic, oracle, it won't be perfectly correct. You may be off by a half percent or a percent. The point is that over time and over enough entries, you should expect your statistics to about match your oracle. And thus you have a probabilistic oracle.

No comments:

Post a Comment