Thursday, July 31, 2008

Overspecifying

I'm working on a Rails project at the moment, and we're using Selenium. That's great, but now we want to move it all onto Selenium Grid. Why? Well, the few tests we have already take something like 15 minutes to run, and it would be nice to be able to spread out the work a little bit. Fifteen minutes is just too long to wait for feedback on a change while doing development, and it's only going to get worse.

I've been working on actually moving the tests to the grid. There are some variables that you set when you run the test: SELENIUM_SERVER_ADDRESS, SELENIUM_PORT, SELENIUM_TEST_ADDRESS, Browser, etc. In getting this working, I've found these variables specified in no fewer than 10 different places in each project!

That's what I call overspecifying.

Sure, you can set these variables, and in theory they will trump each other, but setting them in this many places only leads to problems. How do you know which one(s) to change?

So, unless there's some really good explanation I haven't thought of, stick to setting your test variables in only one location. Just because you can set something and override it doesn't mean you should.

Wednesday, July 30, 2008

Softer Side of Software

Most of the developers I work with are geeks. Vocabularies are large, and jokes about NFS are common. And yet when we talk about the product, the actual language is interesting:
  • "I feel like the product...."
  • "It just doesn't look right..."
  • "It feels more stable..."
  • "Sure, it feels faster...."
  • "I feel like the writes are faster than the reads."
For a bunch of math-oriented geeks, we're sure using a lot of fuzzy words!

The only real takeaway I have here is noticing again that it doesn't matter so much what empirical evidence you have. How people feel about the product still trumps everything else!

Tuesday, July 29, 2008

English-Prime

I've written before about loaded words that QA tends to use - hang, crash, etc. They are words that are often not as precise as we should be. We may say "hang" when we mean "didn't respond to commands for over 2 hours". I've just discovered the idea of "English-Prime", and that really can play into this.

Basically, e-prime, or English-prime, is a form of English that lacks the verb "to be" in all its forms. "The system is hung." is not valid e-prime. "The system does not respond to commands." is valid e-prime.

So what's the point?

The idea is that forcing yourself to avoid "to be" and all its derivatives will force you to avoid making statements about the system. Instead, you are forced to make statements about your interactions with the system and your experiences with the system. Doing this helps you avoid assumptions about the system and the system's intended behavior; you'll be more accurate about your statements.

Give it a shot!

Monday, July 28, 2008

Dig As Far As You Should

We've been running a lot of different tests with a lot of different third-party programs lately. One of these is fairly straightforward: you set up the client program, set up your system, run the right Windows task, and off it goes. It comes back about 14 hours later with a report.

We've done this test a couple of times, and everything's going swimmingly except for one thing: performance is lower than we'd expect. Oh dear.  So we start digging in, and we find something. The first time, the test was running at a time when the network it was running on is absolutely slammed. Other performance tests run under these conditions are 3x slower than they are when the network is less busy. 

Hooray! We've found the problem!

We reran the test at a different time, when the network utilization was much lower. And.... it was still too slow. Oh dear. So we start digging in, and we find something. This time, about two hours into the test, one of the machines in our system had a hardware error and failed.

Hooray! We've found the problem!

Wait a minute. This isn't a particularly cheap test. We can really only run it once every day or two. So let's see keep digging and just check that there's nothing else going on. Lo and behold, we found that even discounting for that piece of hardware, it was still too slow.  Once we started digging again, we found some more things that needed tweaking (more RAM on the client box to keep it going, in this case).


There are two morals to this story: 

First, if a test is not cheap, look beyond the first thing you find before you go trying it again. You'll still come out ahead, timewise. The longer the test, the more time you can (and should!) spend really understanding what went on between test runs.

Second, don't ever assume that that the first thing you find is the right thing to find, or the only thing you will find. It's just the first place you looked.




Friday, July 25, 2008

"That's Cool"

I was in a cross-team meeting today, and the sales person there said, "If someone says 'That's cool!', you know you don't have a sale."

And I got to thinking...

I believe the sales guy is perfectly correct. People - especially those with big budgets - don't buy stuff because it's really neat. They buy it because it fills a need. So having a prospect say, "That's cool!" means the prospect is NOT saying, "Wow I really need that" or "That could save me so much time!".

But...

When you're hiring, having the candidate say, "That's cool!" is a very good thing. The kind of engineer I want is the kind of engineer who's going to listen to a problem we have, or a special kind of trick we've got, and say, "that's cool!". Here, that means that the candidate is thinking he'll be challenged and that the work will be interesting.

So I empathize with the sales guy, but I still get a good thrill when someone I'm pitching to says, "That's cool!".

Thursday, July 24, 2008

Exploring Your System

In my (not so copious) spare time, I play Civilization. It's an interesting game for a lot of reasons, but recently I was playing with a friend and noticed that he explores his world totally differently than I do  - and yet we each had never thought of the other way.

First, some background for those who don't play the game much. Basically, you're dropped into a world, and you have to explore that world, create towns, harvest resources, go to war with (or trade with) the other players, etc. The relevant notion here is exploring the world, which is done by walking people around - basically, you send a guy out and he walks around looking at stuff. The more he sees, the more you know about the world - mountains to the north of you, coastline to the east, desert to the south, etc.

Here's what you see when you first start: not much.

Now I generally explore basically by walking around in circles looking around me. It looks something like this:

This other player explores by making a beeline as far as he can go in one direction. When he hits something impassible - like, say, an ocean - he returns to home base, picks a new direction, and walks as far as he can.


It occurred to me that this is much like we explore a system. Some QA engineers will try all the "happy path" items first. Other QA engineers will do everything they can think of to one specific feature before they touch another feature.

The good news about the happy path types is that they get good coverage of a smoke test fast. If a feature is completely broken, they'll find it. The good news about the beeline types is that they get deep into a feature more quickly. If there is some nefarious but deep bug hiding in a feature, they may find it first... if they picked that feature to look at.

I'd like to make sure my team balances both - coverage and depth.

Wednesday, July 23, 2008

Selling QA

Due to various quirks of my job, QA in general, the seating arrangements where we are, etc., I find myself having to talk to non-tech types about QA a fair amount. These are usually not engineers of any sort; think sales guys, consultants, marketing people, and the like. And then the question comes up:

"What exactly is QA, anyway?"

My task, whatever it was before, just changed. Now I have to sell QA. And this one is a bit tricky. There are a lot of definitions of QA and testing out there. These examples are some of my favorites:
  • Exploring a system in order to provide information.
  • Gaining an understanding of a system and procedures surrounding the system, with an eye toward aiding effective business decisions.
  • Being an advocate for doing things well throughout an organization.

The problem with a simple definition is that it doesn't tell you WHY. Ultimately, that's what the questioner wants to know: why do I care about this function? What does it give me?

That's a little trickier. I tend to answer as follows:

QA's job is to minimize customer surprise.

Now, given the audience - usually sales, marketing, and support - this hits home. Customers value consistency, and problems are okay. What they don't like is to be surprised. So what you're telling your audience is not to worry about the specifics of testing or process or all of the other things that QA does. Just worry about what it gives you. And what it gives you is a customer who isn't going to get a shock and call you up unexpectedly.

So, how do you sell QA?

Tuesday, July 22, 2008

Metrics Vs Goals

There seems to be some confusion about metrics versus goals. What exactly is the difference? And when exactly should I be using each?

Let's start with a definition.*
A metric is a guideline that is used over time to encourage good or effective behaviors.
A goal is a measurable, achievable action or state.

Examples of metrics include:
  • The defect leakage rate (number of bugs found after the software was released) should be no more than 10 bugs per release. This measures how good we are at finding things that would actually show up in the field.
  • Actual implementation time should be 5% closer to estimated time in each release. This measures how accurate we are at estimating and provides a guideline for reasonable improvements. Once we get close, then we can change this to be "no more than 10% off" or something else appropriate.
  • A given point release will have 10% fewer new bugs identified than the previous point release. This is a measurement of the notion that we should be more stable as we move through minor releases.
Examples of goals include:
  • Support a a maximum file size of 1 TB. We need to work to support this, and once the work is done, it's done. Any change in that would be what loss of features normally are - regressions.
  • Implement the new registration page design. Again, this is a new (or changed) feature in the product. Do it once, and you're done.
  • Solve all the validation problems with the login form.

The difference between metrics and goals can be made to sound subtle, but it boils down to whether you are handling a one-time enhancement/change (a goal), or whether you're trying to handle something that will apply over time, even as product features change (a metric). 

Often, goals are for the software you're working on. Metrics are for how you produce that software.





* You don't have to agree with my definition, but at least for this post we're all talking about the same thing.

Monday, July 21, 2008

Benefits Traceability

One of the goals of creating user stories is that each story produces a customer benefit. After all, if the customer doesn't benefit, why do it? This covers stories from features customers want to stories that make support more efficient (the benefit to the customer being that support will respond in a more timely manner). When you get into it, though, sometimes it's hard to remember whether a change will really apply to that story.

Enter benefits traceability. Much like a standard requirements to design to test cases traceability matrix, a benefit trace is intended to help direct you from customer benefit to actual code change.

For example, let's take a story that will result in us changing the scheduler in use in the OS (this is a particularly low-level change). And let's trace that back to the customer benefit that is, we hope, why we're doing the story in the first place:
  • The change: Switch to a faster scheduler
  • A faster scheduler will be able to schedule rebuild operations more quickly
  • Faster rebuild operations give us the ability to remove data from degraded hardware more quickly and get it on to safer (i.e., non-degraded) parts of the storage system.
  • Less time spent on degraded hardware lowers our risk that the hardware will die completely or that another piece of hardware will die while before the rebuild is complete.
  • Less time performing rebuilds makes our exposure to (inevitable) hardware loss much lower
  • Lower exposure to hardware loss makes our system safer
  • The Benefit: The mean time to data loss increases, and our customers' data is safer.
Without the benefit trace, we might have pushed the story off. After all, who really cares about the scheduler? And now we know: this change will make our customers safer, and that's a very important thing in an enterprise archive product.

So sure, Extreme Programming doesn't really have the concept of a formal traceability matrix, but it does have the idea of understanding why you're doing things. So be sure you understand not only the action, but the benefit, and why the action gives you the benefit.

Friday, July 18, 2008

Constraints Have Value

The systems we test are full of constraints.
  • The system can't handle more than 250 volumes.
  • We have to ship by August 31, no matter what.
  • We've chosen to use the singleton pattern here.
  • No double-byte characters allowed in login names.
Now, this seems really obnoxious. We're telling users "no" and "can't", either directly or indirectly.  Although we complain about these constraints (and our users do, too!), constraints are good for users, in general. Without constraints, users have no guidance. Telling users what they can't do is tantamount to telling them what they can do.

So remember, constraints are okay. What's not okay is not knowing your constraints. So find the boundaries of your system and tell users what they are. Feel free to live by the rules!

Thursday, July 17, 2008

Don't Fear the Checklist

I was at CAST 2008 earlier this week and was listening to Cem Kaner's presentation about checklists and their adaptation to software testing. He made several points, and I found myself nodding my head and agreeing (I'm paraphrasing, for the record!):
  • Checklists describe the goal, not the steps.
  • Checklists provide room for people to think; scripts deny testers the flexibility to see the whole picture.
  • Scripts only check for what we thought to explicitly verify; checklists allow you to adapt your verification to what you're seeing in the system.
Amen, brother! I've hired thinking testers. Let's let 'em think!

But....

But....

But...

Checklists are scary!

Wait a minute. Checklists are tools that allow us to hint to testers and increase coverage while providing the freedom for them to think about and dive into a system. That's good!

The problem with checklists is twofold: (1) how do you know when you hit a specific test case?; and (2) How do you document what you actually tested? I'm not going to talk about the second of these; that's a whole separate blog post. But how do you make sure you hit a specific test case?

Let's step back for a minute and talk about why we might want to be certain we hit a specific test case. I can think of two things:
  1. We had a bug here and we need to be sure we haven't reintroduced it.
  2. It's a particularly sensitive action/configuration/setting and we don't want to go to our biggest (or most important, or loudest) customer without being very sure that this will work.
Our checklist doesn't cover that. Our checklist is goal-oriented, not configuration or specific step-oriented. And that's why checklists are scary.

Now, what can we do about it?

There are a lot of options, ranging from "hope you hit it every time" to "forget checklists! scripts are safer!". Instead, we simply acknowledge the realities of the situation. These specific things are important, so we need to write them down in the test plan. They don't have to be in our checklists; they can be elsewhere. They just need to be there. And then we need to run these specific things. They can be manual or automated, as long as they get run.

So go ahead, use checklists where you want testers to actually dig in and think, and use scripts where specificity really matters. Take the benefits of your checklists and eliminate the areas that are scary by scripting them away. After all...

Checklists aren't scary!

Wednesday, July 16, 2008

Eureka Moment

I think the most famous moment in all history is the eureka moment. And in our own ways, we have small eureka moments.

...the moment when you finally understand the problem you've been working on for weeks.

...the moment when a graph clicks and you can see what your data has been trying to tell you.

...the moment you put pen to paper and can actually draw the system diagram.

Some eureka moments are enough to make you want to dance around the office (note for those reading: please don't!). Others, though, are anticlimactic. When you get there, the next logical step is something. If you haven't done it before, that next logical step is a eureka moment, however small.

I think as leads we have a responsibility to recognize growth in ourselves and in the people we work with. So take the time to notice the small eureka moments - yours and others - and celebrate them, even the ones that are just the next logical place to go.

Tuesday, July 15, 2008

Talking to Non-Testers

Tests and test results are wonderful and interesting things... for testers. Eventually, though, you'll have to report your data to other parts of the organization. And herein lies the trouble: how do you effectively communicate with a non-tester?

Communicating test results within a test organization is easy for one reason: shared context. You all know what you're talking about. Sure, other groups and other companies may use different jargon, but within your group there is a shared definition of terms that facilitates discussion.

Non-testers don't have context. Non-testers don't know the things you know, and don't use the same terms you do with the same meanings. So how do you communicate effectively when you're talking with someone who doesn't quite speak your language?

When you're creating results or putting together communication to non-testers, there are a few things to remember:

Don't Talk Down
You're in a professional organization, likely a professional software organization. The people around you are almost certainly educated and spend their days thinking, just like you do. Don't talk down to them; being condescending is a really good way to make people stop listening. Sure, the marketing director doesn't understand your test report at anything other than a shallow red/green level. I'd wager you don't understand the marketer's SEO report to any deeper level. Neither of you is dumb for this, just uninformed, so don't talk like any moron understands a test report and it takes a special marketing moron to really not get it.

Avoid Jargon
Jargon is a fancy way of doing two things: (1) making someone else feel excluded; and (2) advertising your brilliance. When you're trying to communicate with someone, both these things are bad. Trying to make someone else feel excluded by throwing around fancy phrases is just another way of talking down to that person. And making yourself look smart is all well and good, but you'll look smarter if you can do AND teach. So instead of saying, "we did equivalence class partitioning", say, "we classified all the data into groups and made sure we tested something from every group." For the level of test results reporting, these are saying the same thing, and the latter uses words that your marketer or sales guy or product manager can understand. Now the 17 columns of green marks start to make sense: "oh! Those are your groups!"

Provide Background
Don't just say that the tests failed and now we really need to push back that marketing campaign. Provide context. It doesn't need to be much more than a couple of sentences, but it does need to describe why the test was done and what on earth it is. And keep in mind, no jargon. Don't even call it a functional test or a performance test. Just say, "we were checking out the banner feature for the next release and...." or "Our reference customer does X a lot, and before the beta goes to that customer we want to make sure that their experience doesn't deteriorate because we really want to keep this customer as a reference."

Be Relevant
Your message needs to matter to your audience. Telling a sales person about queue lengths will get you a glazed look. Telling a developer about queue lengths, on the other hand, will both reiterate your technical credentials and will provide something tangible the developer can use to look into the system. So tell people the aspects of the test that they care about. Tell marketers how their banner features and big ad campaigns will be affected. Tell sales people how safe their software is and how you're thinking of their current customers and their prospects. Tell developers what the problem really is and where to start looking for a solution.

Cede Power
Ideally as you communicate, you hand power over to the other party with whom you're communicating. When someone feels powerful they're more likely to communicate effectively. You're trying to explain a foreign concept to someone, and they're already going to be uncomfortable. Giving the person (or group) the power to control the communication alleviates that discomfort somewhat. This will also help you understand what the person you're talking to really needs to know.

Conclude First
Once you have a basic context, lead with the good stuff - your conclusions. The person or group you're communicating with can ask questions as long as they like or they can walk away (back to the power thing above), and either way, they got what you really needed them to get. The ultimate thing you're communicating is the outcome of your tests. If they care about how you got there or the next set of tests or anything else, that's great, but if there's only one thing that you can communicate, then it should be your results.

So, long story short, when you're communicating with non-testers you can't just dash off a quick email and expect everyone to both understand and care. Recognize that they don't want to enter your world (probably no more than you want to enter their world!) but they do want to know what you know, at least as far as it matters to their day-to-day activities. Take the time to tailor your message for your audience and watch how much farther it goes.

Monday, July 14, 2008

Simplification

I've been on a "simplify, simplify" trend recently. Complex systems are a fact of life, and sometimes we have to address that complexity head on. However, for many tests and for many types of tests, addressing a complex system isn't really necessary. We can often break a system into its constituent parts and test those.

Breaking a system into its constituent pieces and identifying those pieces gives us direct access to each part of the system for testing. We can better exercise things we can interact with directly. Inserting abstractions - other methods or entire layers - makes it harder to actually touch the thing we're testing. Take this notion far enough and we've got a unit test. Take it bit less far and you've got a manageable test that exercises what you want it to, and nothing extra.

How simple is simple enough? The short answer is "as simple as is relevant and no simpler". So our challenge is to figure out how small the piece of the system can be and still produce a test with all the relevant variables. There is no magic tool to do this; it requires pretty good knowledge of your system. Start with the entire system and just keep removing cruft (aka things in your system that are great and relevant, but not for this particular thing) until removing cruft changes your test. I find it easier to go through this paring down process with a very few test cases; this makes it much easier to see when your results change and you've paired down too much.

So, when complexity is needed - integration tests, full systems tests, etc - tackle the complexity of your system head on. When complexity is not needed, though, test the system simply and directly. You'll get more precise results... without pulling your hair out!

Friday, July 11, 2008

How Much Is Enough?

Certain kinds of tests are non-determinative.  That is, you cannot tell the result except to say that it's within acceptable levels. Performance tests are a good example of this: 10.5 MB/sec and 10.6 MB/sec may both be valid results if anything over 10 MB/sec is valid. Race conditions are another good example. If you don't see the problem, how do you know if you've fixed it or if you just haven't triggered it yet?

In both of these cases, there is no simple assert true on something. Instead, you have to assert within a range, or assert outside of a range. There are two interesting things to consider about this kind of test:
  • What is the acceptable variance in results?
  • How do you assert that the probability of something happening is vanishingly small?
And here's where statistics meets QA.  Let's take these two example problems one at a time:

Proving That Performance Is Better Than X
Assume we've already done our background work and we know that on a given system in given conditions with known data patterns, performance of 10 MB/sec is the minimum acceptable level. Anything over that is good.

So we write a performance test. It sets up the system, writes in that data pattern, and checks throughput. If throughput is greater than 10 MB/sec, we pass! Except.... Passing once isn't good enough. There are a lot of things that could change: network utilization, pre-existing disk usage, etc. Our performance over time is going to resemble a bell curve, and we need the entire bell curve to be over 10 MB/sec. So, we have to run our test enough to prove that the bell curve is over our minimum performance requirement.

Hold that thought for a minute while we talk about our other example.

Proving That a Race Condition Is Fixed
We had a race condition (oops!). Now we've fixed it! We can't just run a test once; if we didn't see it, then we don't know whether it was fixed or whether we simply didn't hit the condition. So, how do we prove to our satisfaction that it's fixed?

Again, we're looking at a situation where we need to write a test that exercises the race. Then we run it in a loop, figure out how often it fails on the old code, and then run it on the new code. Basically, we can't say we've proved the fix, but we can run the test enough times such that the race has a statistically tiny chance of still being present. And now we've looped back to our bell curve.

The Underlying Method
The underlying method is the same in both examples. You know that your responses in these cases will have some variance, and that variance is likely to be a fairly standard bell curve.


So, you create a test that you can run enough to create your bell curve. Then you plot your results and show that the expected value (i.e., odds of having a race condition, or minimum performance) is far enough out on the tail of the bell curve that you're satisfied.

The good news about this method is that it's a fairly straightforward thing to tweak. If your tolerance for risk is higher, you can run the test less - you'll just be at a different confidence level. If your tolerance for risk is very low, you can run the test more - and that will increase your confidence level.

So, while assert true is a simple and comforting thing, when you're faced with a non-determinative test, embrace the bell curve!

Thursday, July 10, 2008

CAST 2008

Just a quick note that I will be at CAST 2008 next week. Looking forward to seeing people there!

Wednesday, July 9, 2008

Problems in the Small

Up front disclaimer: this is just an idea. I haven't actually tried this yet.

We've been talking a lot at work about extended tests and about tests. The idea is that you get close to a customer's environment (or a potential customer's environment), and then the wacky timing issues, edge cases, and other nefarious bugs will start to show up. The problem with this is twofold: (1) it takes a long time and a lot of resources to do; and (2) it can't tell you a thing works. It can only tell you that a given situation is either properly handled or hasn't occurred.

What if we looked at the problem the other way around? Basically, we're trying to simulate a large environment with the idea that we'll eventually hit the rough edges of the code. What if, instead, we looked at the problem in the small? Take one tiny area of code, and spend a lot of time putting random things around it.

An example might be useful.

Let's say we have a system that logs data. The client writes a data stream to it, and internally, the system stores files in a given format. At some point in the lifecycle of the system, we have an upgrade that changes the format of the files. This is a fairly standard thing, and many of the unit tests are pretty obvious:
  • add file after upgrade
  • read file added before upgrade
  • modify file added before upgrade
  • modify file modified before upgrade
  • delete file added before upgrade
  • delete file added after upgrade
  • modify file added after upgrade
  • etc...
But there are edge cases, and we're not going to think of them all. And there are race conditions and other things that simply aren't going to be caught by unit tests.

When we do our functional through the external client simulation, we're not even working on files, we're working on the log stream. We're just hoping that the system, through unit tests and random luck, happens to cover all the cases. By moving out a layer, we've abstracted ourselves from the thing we're trying to test - it's like trying to type with chopsticks!

So what if we do a really tiny randomized test? It might look something like this:
  1. do a bunch of operations - adding, deleting, modifying, reading, etc. - at the lowest level the system handles files.
  2. upgrade
  3. do a bunch of operations - adding, deleting, modifying, reading, etc.  - at that same lowest level the system handles files.
It looks a lot like our big test that "simulates a client environment", but it takes it down deeper into the code. This would be done at the internal file handling level, instead of the more abstracted logging level. We still get the timing issues that work themselves out in a semi-random simulation test, but we don't get the abstraction that hides timing issues.

I'm interested to try this. We'll see if it works!

Tuesday, July 8, 2008

What and Why

I find myself asking for things a lot.

"Hey, can you please take a look at this bug?"
"I was wondering about this. Can we look?"
"Okay, we think we need a build with gprof enabled."
"We're happy to take a peek at the upgrader, but we're going to need a build."
"Bug 12345 is killing us. Any way we can put it near the top of the queue?"

Asking for things is a fact of life in a highly collaborative environment. I get asked for things quite often, too, and that's okay. Simply asking, though, isn't enough to get it done. As the asker, the onus is on me to make it easy for the person to do whatever I'm asking.

There are two pieces of information give you a much higher chance of actually getting a response to your request: (1) what you want, and (2) why you want it. 

Describe not only the problem but also what you want done about it. Simply saying, "This is a problem." leaves ambiguity. The person you're asking for something may not do anything (after all, you haven't actually asked for some action), or may do something you don't expect.

Describe why you want this thing. This helps the other person invest in the issue. It matters more if the answer to "why am I doing this?" is apparent to the person who will actually do some work. It's really about respect, in this case; you're telling the other person that you understand the value of their time and understand that you need to justify (i.e., tell them why) this is worth spending their time on.

So feel free to ask for something. Feel free to be asked for something. Just make sure you let people know what you want and why, and it won't get pushed to the bottom of the person's queue.

Monday, July 7, 2008

Bowl of Marbles

I was thinking over the weekend about the classic "Project Triangle" of software.
This triangle is a bit of a cliche, but there is definitely a nugget of truth in there. You really can't have it all, no matter how much you want it. The problem I have is that it's a bit hard to explain - drawing triangles out is a little bit abstract when you're explaining this principle - and more than once I've gotten the "well, let's just add some resources and make the triangle bigger" response.

The fundamental problem that we're trying to explain here is that releases are dependent on several factors:
  • How many resources you have (people, machines, etc)
  • How much time you have
  • How many features you want
  • How many hurdles (read: tests that prove something about the feature) it has to clear (aka "How good it has to be")
  • How well-designed you want it
Changing any one of these changes the release itself. Now the other interesting thing about this is that the point in the project at which these become fixed varies. Well-designed, for example, often comes early. You're not likely to change your fundamental design at the end of a project. The same holds true for resources; at some point throwing more resources at the problem no longer helps. Time and quality tend to be something that can change up to the end of the project.

In trying to to explain this to people who don't work in software (or who don't work deeply enough in software to understand its limitations), I tend to wind up using a different metaphor. Here's the story I tell:

When we start on a release, we make a few choices that really govern everything else that goes on. We chose a basic scope of the release, and we choose a design.


In essence, we're pulling a bowl off a shelf. We can go back to the shelf and change bowls later, but it's going to cost us a lot, and as our bowl gets full (i.e., as we get later in the release), we are restricted in the bowls we can choose and still have everything fit.

Second, we choose the features we want in the release. These can be new features, enhancements to existing features, bugs, whatever. We wind up with a big set of marbles that we're going to work on.


As development continues, we pick the marbles up, work with them for a while, and then put them in the bowl. This is where things start to get interesting. If we've chosen a lot of big marbles, then we're not going to fit them all in the bowl. If we've chosen too many marbles, they won't all fit in the bowl. If we've chosen too few marbles, then we'll have a half-empty bowl.

We still have options at this point:
  1. We can take marbles off the table. This is the metaphor for removing features (or bugs or whatever) from the release. The catch is, once we've started work, it's harder to remove the marble from the table; that is, it's harder to pull the code out once we've started it. 
  2. We can take marbles out of the bowl. If the marble (aka the feature) is too big and we want more of the smaller marbles, that's possible. It's a lot harder than if we hadn't put it in to start with, though.
  3. We can go get a different bowl. If we get a bigger bowl, it'll fit, but we lose some time changing the bowl. If we get a smaller bowl, then not everything fits and we've made a mess! (This is what we call a disastrous release!) This is the metaphor for changing the release size or design. It's doable, but there's increased risk with it.
So you have lots of choices and lots of ways to make a release happen - different bowl sizes, different marbles - but in the end there are some things that can't be changed. All you can do is pick a bowl, pick some marbles, and make sure the marbles fit inside the bowl.

Thursday, July 3, 2008

Decision Makers and Decision Responders

When there are two or three people in your (software) company, decision making is really easy: either one person makes the decision or everyone does. In general, this makes sense because when you're that small, pretty much everyone is doing everything, and pretty much everyone knows all stressors and influencers that might affect decisions.
  • "What should I call this variable?": one person
  • "Is this algorithm appropriate?": probably one person
  • "Are we ready to release this version?": probably everyone
  • "Should we take on this really big contract?": probably everyone
When the company is a little bit bigger and there is more specialization of roles, it gets rather trickier. Now you've got a sales guy and a developer and a QA type and a marketing guy and a support guy. You're too big to put everyone in a room for every single decision.
  • "What should I call this variable?": still only one person
  • "What color should our logo be?": one person (but now it's a different person!)
  • "Are we ready to release this version?": the developer, the QA type, the support guy, and maybe the marketing guy.
  • "Do we want to take on this really big contract?": still everybody. This is what they call a game-changing event!
Eventually, someone's going to feel left out of a decision they wanted to be part of, or someone's going to feel involved in a decision that's way over their heads. This is when you need to figure out the difference between people who need to be part of decisions, people who need to be informed of decisions, and people who don't care. The last category is usually the easiest: in general, asking them which option they prefer will get a response of "I don't care" or "I have no idea. Ya'll pick." The first two categories are harder to distinguish. In the end, I have one rule of thumb for who needs to be involved in a decision:

The people who are involved in making the decisions should be the people who provide value to the decision. 

If someone has information or could say something that would alter the outcome of the decision, then that person belongs in the decision making process. Everyone else can simply be informed later.*





* Yes, this sounds harsh, and it is. But decision making is about getting the right decision made, not about making people feel good. Look for other areas to make employees feel valued, areas where making them feel good doesn't slow the company down. And remember, the employee has a responsibility to the company as well; if you work hard and add value, then you'll be involved in the decision merely by having the knowledge and skills necessary to make a good decision.

Wednesday, July 2, 2008

Time Bombs

In some areas of products, when something fails you know it immediately. You click a button and the system throws an error, for example. Some failures are trickier, though; they may fail and you may not know it for a while. Congratulations, you've hit a time bomb.

There are two kinds of time bombs: those caused by something in the product, and those caused by something outside the product.

Product-Based Time Bombs
Often, time bombs inside the product are in areas where there is a secondary process not essential to the basic function of the system. Think monitoring processes, logging processes, cleanup processes, etc. These are things that can fail and the system behavior won't be any different... on the surface.

Noticing a product-based time bomb before it causes a problem is a matter of proactively looking at logs and finding the clues. Sometimes there's an error or warning with no effect, sometimes there's a process not running that should be. The symptoms vary based on feature.

Environment-Based Time Bombs
These are real fun ones. Something not in your product changes and causes a problem down the line. Culprits here include DNS or other network changes, and changes in any third-party systems with which your system interacts.

These are particularly tricky because there will likely be little to no indication of a problem in the logs. Your best bet here is to cultivate a good relationship with those who control these areas and make sure they let you know when something happens.

Finding a Time Bomb
The time to think of time bombs is when you're faced with an error or failure that doesn't seem to have a proximate cause. This is particularly true if it's something that depends on a feature or process that doesn't get touched very often. Think of a time bomb if you have:
  • A full disk
  • A failed log parse
  • Difficulties talking to machines that are up
  • A monitored thing fail and no expected result (notification, restart, takeover, etc)

When you suspect a time bomb, there's really only one way to figure it out. Find out when the background process was used successfully, and start looking at whether that failed.

And when you have found your time bomb, make sure your fix includes preventing the failure and making any failure shout more loudly so you notice it.... before the bomb goes off.

Tuesday, July 1, 2008

Implicit Assumptions

I've written before about why you ask questions. One other thing that is important is HOW you ask a question. Asking the question different ways will get you very different answers.

The Setup
Let's say we're debugging a problem in which machines in your lab keep being seen as down. The system is such that when a machine is seen as down it could have sensed a problem and turned itself off, or it could have not responded in long enough that the other machines marked it down. About once a day, a machine will be marked as down. Rebooting the machine and adding it back has no ill effects.

The Desire
The desire is for machines to only be marked as down when there is a real problem with them. Currently, the system may be deviating from that desirable behavior (or there may be an undetected but real problem with the machines).

The Question
Here is where it gets interesting. How do you find out what's going on, and get to your desired state? There are several ways to ask the questions:

Option 1: "The system is spuriously marking machines as down when they're fine. The system uses pings to figure this out. Why aren't the pings returning in time?"
This option has a whole lot of assumptions in it, and colors the direction of investigation. Here we're already positing:
  • that the machines are fine, 
  • that the system is attempting to detect the state of the machine, 
  • that the machine is not sending out erroneous trouble messages,
  • that the problem is in pinging the other machines,
  • that the system is either not receiving or ignoring the ping responses
If you already have all that information, then this is a perfectly valid question. It shouldn't be the first way you ask the question based on the problem description. This is also likely to make the person responsible for pings a bit defensive that his area is being focused on when we haven't even proven that's the problem.

Option 2: "Machines seem to be getting marked as down more than we would expect. We need to figure out if the machines are really okay, and why they're getting marked as down."

This is generally roughly how I'm going to try to ask the question. Give a summary of what's going on and why it's perceived as a problem. Then offer a couple of concrete directions that, when understood, will eliminate problem areas. The goal is to focus effort without eliminating paths of enquiry; make the first part of the effort about subdividing the problem.

Option 3: "Why is this behavior occurring?"

This fails to state the problem. It merely asks for an understanding of behavior, and lacks focus. It's really only good for giving to people who know the system very well. Left alone there are a whole lot of paths to pursue; having someone who knows the system and the desired behavior will help focus the issue, or the question can be modified to do that.

Question Parts
There are several parts to asking a debugging question like this successfully:
  1. State (or reference) the desired behavior.
  2. State (or reference) existing knowledge
  3. Do not eliminate avenues of investigation or areas of a system unless that is backed by concrete information
  4. State the success criteria; that is, under what conditions will the question be completely answered?
Just by understanding the system and the problem well enough to articulate the question, you assume some authority. Your listener will believe you have insider knowledge. So phrase accordingly.... and carefully.