Monday, March 31, 2008

Clueful Hands

On the good news, bad news front....

The good news: where I work we have a culture in which everyone pitches in when asked. Not your job? Who cares!

The bad news: how do you properly ask for help when no or few special skills are needed? The traditional project management phrasing " I need some resources" is rather demeaning when you're asking for volunteers.*

So around here we ask nicely. We ask for "clueful hands" to help.

Asking for help in a tech shop is tricky. You need people with a specific set of skills (don't ask the Windows desktop support guy for help reconfiguring the Cisco router). However, you're also asking for help, and beggars can't be choosers. So ask for hands (or bodies, but I find that a bit more likely to offend), but be sure to ask for clueful hands. It gives the people who would like to help but don't have the skills a graceful way to back out.

* On a totally unrelated side note, I was at a company once where it slipped out that someone needed "a dev resource" in a meeting with development. We called each other resource for months after that: "Hello, QA resource!" "Hi, dev resource! How was your weekend?"

Friday, March 28, 2008

Meetings vs Conversation

One of the things I enjoy about my current job is how few meetings I go to.  Right now I have four (count 'em, 4) meetings a week. I can't say I'm a fan of meetings in general, but I haven't figured out a way to not need the few I do go to.

Despite how few meetings I have, somehow I look back at the end of the day and realize I spent a chunk of time away from my desk, with other people. If it wasn't a meeting, what on earth have I been doing all day? Well, really, I've been having conversations. And that's perfectly fine with me.

So to summarize:
Meetings bad (usually). Conversation good (usually).

So why do I like meetings much less than conversation? Let's compare and contrast:
  • Obligation versus Purpose. Meetings are an obligation; they're something you do because it pops up on your calendar and says, "time to go!".  Most meetings have an agenda and some weeks there is more to talk about than other weeks, but you always go anyway. A conversation has a purpose; you start a conversation (or someone starts a conversation with you) because you want to accomplish something.
  • Rigid versus Agile. A meeting has an agenda. It's what you talk about pretty much every time the meeting comes around. A conversation is more agile. You generally have an idea of what you want to talk about when you start the conversation, but it can meander and change topics, and you don't get the dreaded "let's take this offline" postponement. I find I solve more types of problems in one conversation than I do in several meetings, and that's because we can follow the thread from problem to problem.
  • Waiting versus Doing. Sometimes I'll have something I want to talk to a person about, and I find myself saying, "well, I'll just cover it in the meeting". All well and good (I've given the meeting more purpose), but really why should I have waited those few hours or days? Why not have the conversation? Having a set time to talk about things (a meeting) ensures that these items get talked about no later than the meeting, but they also encourage those things to get talked about no earlier than the meeting, and that's counterproductive.
All this being said, there are some purposes for meetings, but I keep to as few meetings as possible. Give me a conversation over a meeting any day!

Thursday, March 27, 2008

Neat Trick

Often when I'm writing tests, I'll wind up with a series of test cases that are all nearly the same and that try to exercise a single thing in several ways. In this case I find myself with two problems:
  1. I don't really care about the other data in the form as long as it doesn't interfere.
  2. I really hate copying code or data over and over.
So how do we solve this? Well, this may be test code, but it's still just code. So what do we do? Refactor and eliminate duplication - of code AND data.

This is probably easiest to explain by example.

Let's say we have a registration form with several fields (first name, last name, username, password, and confirm password). At the moment we're interested in testing the first name field. This means we don't care what "last name" is as long as it's valid. Same goes for username, password, and confirm password.

First Attempt
So we write a simple test:
def test_first_name_one_word
num_reg = Registration.count
get :new
put :update, :id => assigns(:registration).id,
:registration => {:first_name => "John",
:last_name => "Smith",
:username => "jsmith",
:password => "12345",
:confirm_password => "12345"
assert_equal num_reg + 1, Registration.count

(I'm leaving out some assertions just to make this easier.)
Awesome, this is great. My test passes, and life is good.

Problem 1: Many Registrations
Now wait a second. I'm going to have a LOT of registrations by the time I'm done writing the tests for all of these fields. What if my product manager comes back and adds a field then? That's a whole lot of code to change.... ick!

Okay, let's refactor a bit. We'll put the actual registration into a method!

Second Attempt
Create a registerUser method that actually does the registration, then I'll just call it from each of my tests.

def registerUser()
get :new
put :update, :id => assigns(:registration).id,
:registration => {:first_name => "John",
:last_name => "Smith",
:username => "jsmith",
:password => "12345",
:confirm_password => "12345"

And my tests will call it:
def test_first_name_one_word
num_reg = Registration.count
assert_equal num_reg + 1, Registration.count

Cool. That works as far as my what-if-registration-changes scenario.

Problem 2: Changing Data
Well now that I have a separate method I need to handle data that might not always be easy to get. I'm not always going to want the same first name, etc.

Third Attempt
So the first thing we're going to have to do is create a registration object with all the data I need in it. I'll put that in a method so it's really easy to call, override or extend later.

So first I create an object:
def reg_attributes()
:first_name => "John",
:last_name => "Smith",
:username => "jsmith",
:password => "12345",
:confirm_password => "12345"

My registerUser method will then use the object I've created.
def registerUser()
get :new
put :update, :id => assigns(:registration).id,
:registration => :reg_attributes

And my tests will call the register method.
def test_first_name_one_word
num_reg = Registration.count
assert_equal num_reg + 1, Registration.count

Problem 3: Still Not Changing Data
Okay, my data is all nice and isolated, and I can use it for all sorts of things - registration, login, etc. But I still can't change the data.

Fourth Attempt: The Neat Trick
This is where things get really neat. What I want to do is use the basic object I'd provided myself (the :reg_attributes method), but override the pieces I need to. So what I do is take advantage of a feature in Ruby where if I specify something and there's a default, it will use the one I want (this feature is not only in Ruby, of course, but check what your language does).

So what I'm going to do is provide a way to pass in a hash of overrides. Then I'll just plumb that all the way through and use it in my test.

Here's my override-able object (defaulted to an empty hash):
def reg_attributes(overrides = {})
:first_name => "John",
:last_name => "Smith",
:username => "jsmith",
:password => "12345",
:confirm_password => "12345"

My registerUser method will then take in any overrides I pass it and hand them along to the object I created.
def registerUser(overrides = {})
get :new
put :update, :id => assigns(:registration).id,
:registration => :reg_attributes(overrides)

And my tests will call the register method with any overrides that it wants.
def test_first_name_one_word
num_reg = Registration.count
registerUser(:first_name => "Mary Jane")
assert_equal num_reg + 1, Registration.count

The advantage to doing it as a hash is that it can have many members, or just one, or none at all. So I can pass in a first name, a last name, or both equally easily.

Pretty neat, huh?

*Note: All examples in this entry are in Ruby. Pick your poison; the principles apply to other languages.

Wednesday, March 26, 2008

The Weirdest Things Can Send You Sideways

There's an old cliche - "Expect the unexpected" - that should really define a good chunk of software engineering.

Everything's going great and then... your hard drive goes kaput.
Everything's going great and then... big snowstorm happens and the power goes out.
Everything's going great and then... you notice that another part of the system you happen to be testing just started behaving oddly.

For QA I'd like to tweak that a little and say "Embrace the unexpected". 

Finding the unexpected can short-circuit your schedule. It can change all your well-laid plans for the day. But it's also where you'll find your best bugs. One of the best ways to find good deep systemic bugs is to not look right at them.

So let's look at part of our arsenal of testing techniques*:
  • Basic Test Cases (Scripted). Useful for things like input validation. Generally these are things you're going to want to automate.
  • Exploratory Testing. Useful for discovering a system or a new feature of a system. Great for a complex system with a lot of different uses. Very highly dependent on having a good tester at the keyboard and takes a great deal of testing creativity. This isn't something for novices to try without additional guidance.
  • Session-Based Testing. Useful when you have a team of testers and need to manage it to a greater degree than exploratory testing. Test charters can provide some guidance as to what to test and (to a minimal extent) how. I've had good luck using this as an acceptance test with users (or other non-testers) involved. Difficult to set up correctly, but can be highly effective.
The beauty of embracing the unexpected is that it fits with each of these techniques, but it isn't a technique in itself. You're not following a script. You're not following a charter. You're not posing a question and getting an answer from the system. You're just... noticing.

So awesome. Use whatever technique(s) you like and are appropriate for the system. Just keep an eye out for the unexpected and learn to love the "that's weird" moments.

*Disclaimer: There are many many more testing techniques out there. In order to make this a blog entry and not a book I've limited this to the techniques I use most often.

Tuesday, March 25, 2008

Overheard on the Street

I know it's been a few days since I updated; I was in Chicago for a long weekend of fun, sun.... err... food and museums, anyway! We ate at Alinea and Charlie Trotter's, and generally had a grand time. Anyway, much as I love food, this is a testing blog, not a food blog.

So put this down to overheard on the street. We were walking down Wacker Dr headed vaguely toward Sears Tower, surrounded by 30+ story buildings plastered with bank names. And this was the conversation at lunch:
"AJAX callbacks will increase the load on your database"
"No way. We'll handle it all in cache."

I hadn't thought of Chicago as a tech center. But amusing nonetheless...

Oh, and just because we can't have a completely light post, check out Mike Bowler's Grouping of Agile Practices. It's an interesting read.

Thursday, March 20, 2008

Drown the New Guy

We just got a new lead for one of our teams. This team hadn't had a lead or even a dedicated staff in a while, and work on that component by basically begging/borrowing/stealing time from other teams. What had to get done got done, but it's not been ideal.

Now we have a new lead, and a couple of engineers going to join that team full time. First of all, this is great news. I have a lot of respect for the engineer who's taking on the lead position. Having the core of team come from internal sources means the ramp-up time should be a lot shorter and less painful. So...

What's the best thing I can do to help this team be successful? I already know the worst thing I can do: pile on and drown the new guy (oops!). But how do I (and my team) do the most helpful thing?

Short answer: the best thing I can do right now is back off.

Longer answer: This team lead, and really his whole team, is in a precarious position. They have new responsibilities, a code base that could use some maintenance, a team to gel, hiring to do, etc. Oh, and everyone wants a piece of them.

Product management wants to talk about the backlog of work that hasn't been important enough before but now that we have a dedicated team we should really look at.

HR needs to help get hiring up to speed and start the flow of resumes.

The other dev leads need to take some time to transition off the old work.

QA? QA can wait. Sure, there's a lot we need - like to talk about the structural improvements we'd like to see to close out some of the testability or the holes that tend to cause a lot of different bugs. But this team isn't going to make structural changes in the next week or so. Therefore, QA can wait. The bugs are still there, the testing is still there, but waiting another week won't kill anyone.

The moral of the story is that no matter how tempting, resist the urge to join the pile jumping on top of the new team. Sit back, be glad they're working on it, and let them find their feet before you get too involved.

Wednesday, March 19, 2008


I get a lot of information from the Internet. Blogs, Google, articles, etc. are all great sources. Sometimes I need more theoretical or in depth coverage of the topic I'm interested in, though. For example, I might want to refresh my ideas on how to divide functional testing, or pattern recognition and consequent test categorization. That's when I find myself reaching for a book.

After an exhaustive search, I have concluded that there are approximately eleventy-million books about software testing.* Many of them are quite good, and some of them really don't offer anything new. But there are two testing books that I couldn't live without. These are two books that I really consider classics:

The Art of Software Testing was first published in 1979, so it's something of an antique in the computer world. It was the first seminal work on software testing and it has aged well. I think in part because it was first, it covers testing from a lot of different angles and really provides a good grounding on what testing is and isn't. This is a book that swings from test management, to graph analysis for coverage, to test bias and how to recognize and overcome it. The biggest lack in the book is that some of the practical applications are done differently now, and there are newer types of testing that the book doesn't cover (TDD being a prominent example). This is a book I turn to for help with testing strategies. Note that I haven't spent much time with the second edition, so I wax rhapsodic about the first edition only!

Testing Extreme Programming is at the opposite end of the spectrum. This is a very tactical book. It spends a lot of time on the things that The Art of Software Testing doesn't cover, like test automation, test-driven development, and the specifics of object-oriented test development. However, there's not a lot of higher-order testing, debugging techniques, or explanation about why or in what situations tests should be developed in certain ways. This is the book I turn to for developer-accessible explanations and for specifics of test automation techniques.

*Disclaimer: I didn't actually count, but we'll just all agree that there are a lot.

Tuesday, March 18, 2008

Smaller Chunks

Estimating a task or project is difficult. It's easy to give a wild guess - "that'll take 3 months" and get within 30% or so. To get any more accurate, though, you really have to think about what you're doing, and that can be difficult for even well-known projects.

One of the QA engineers at work is learning estimation in general and this is the task he was working on:

Automatically put error messages in the test reports.

Some Background:
  • The test report already exists and is generated by an existing script.
  • There are differences in error message indicators (e.g., FAIL or ERROR or fatal) in each test suite.
  • All test suites and logs are located in a single folder (a subfolder per test suite)
The First Estimate:
"Gosh, I don't know. A couple of days?"
That officially falls into the category of guess.

Working Through It:
We don't know how long this task will take. So the simplest strategy I know applies here:

Make the task smaller until you understand it.

We'll break the task down into component pieces, and then break those pieces until we do understand it. Let's give it a shot. Here's what we have to do:
  1. Figure out how to grab the failure line
  2. Figure out how many lines around it we have to write in order to pick up the full failure
  3. Repeat for each test in the suite (note here that it should be the same for every test in the suite, so this is just a confirmation)
  4. Repeat for each suite (there are 30 of these)
Is that small enough? Nope. We still can't estimate that step 1. So let's break it down further. To complete step 1 we have to:
  1. Build the path to the test suite log directory
  2. Build the path to the individual failing test log
  3. Open the log file and read a line (any line) out of it
  4. Print the line into the test report
  5. Change the logic to find the actual failure line and print that instead
  6. Start printing some number of lines before and after the failure line
  7. make before or after or both a configurable parameter
  8. make the number of lines to print a configurable parameter
Now that we can estimate. Granted, many of these tasks are 5 minutes only, but going through this means that now we have an estimate we can be confident in.  Launder, rinse, repeat for steps 2 through n. Now we've got a good estimate.

So how do you get better at estimations? Keep breaking it down. Over time you'll be able to estimate larger pieces and it will go faster. But keep at it in the meantime and remember... this is hard stuff, but we can do it.

Monday, March 17, 2008

Welcome Home Computer!

It's amazing how much relief there is in getting a computer back from the shop.

The computer formerly known as "dramatic-exit" is back from the shop and ready to go. It did need a reinstall, but I can handle it. Thanks, Apple guys, for deciding that one for me, and thanks backup software for making this not a big deal.

I did change its name, though. I'm now calling it "daring-enterprise". In my head, its full name is "daring-enterprise-that-does-not-end-in-flames".

Now, back to setting this thing up!

Friday, March 14, 2008

Absolutist Elegance

We hear all the time about testers who break things...

"I broke login!"
"What'd you do, break it again?"
"I thought I fixed that! How'd you break it?"

But testers don't really break things. They merely expose breakages that are already there.

That sounds like a pretty good notion (I should note that I was inspired by a recent DevelopSense post, from which I lifted the theory.). It really is shorthand for "don't blame the tester". It's an elegant way to say that this situation is not my fault.

I have a problem with that argument. The abdication of tester responsibility bothers me. Saying that the breakage is never the tester's fault seems too broad.
We've officially entered a grey area. And this is where context is everything. Sometimes it's truly a bug and in that case, the tester didn't cause the bug; he merely identified the bug. But sometimes the tester is dong things in a truly unsupported way, and in that case, she may have broken it.

When the tester didn't break it:
  • When it's really a bug.
  • When the tester performed an action that users could be expected to do
When the tester actually did break it:
  • When he used an internal function unavailable to customers incorrectly
  • When he checked in test code that broke the test (and the underlying product is fine)
  • When the tester performed an action that users are specifically warned not to do
So let's avoid the absolutist statements, no matter how good they sound. Even better, let's forget about fault and concentrate on fixing it.

* Disclaimer: I'm not really sure if "absolutist" is a word, but it gets the point across and, hey, English is a living language!

Thursday, March 13, 2008

Consistency Matters... Sometimes

Consistency is a hallmark of QA. Consistent behaviors are comforting, consistent results are good, consistent steps to reproduce the problem so we can understand it are a goal of most bugs.

But how much consistency is too much?

Consistent production environments are generally common. Consistent test environments are also common. Consistent development environments are often a goal. Consistent requirements generation environments are.... well, funny, that doesn't usually come up!

So let's walk through the development cycle (backwards). Assuming an enterprise deployment (not a consumer application):
  • Consistency in Production Environments: GOOD. Assuming you control this, more consistency is useful; the only changes you want are those you control, and usually fall into the configuration category.
  • Consistency in Test Environments: GOOD. These should match production with certain known exceptions (usually test tools like Wireshark).
  • Consistency in Development Environments: LIMITED. Good developers have a setup that works for them. Within limits it shouldn't matter if one person likes Eclipse and another strongly prefers TextMate. Consistency enforced here is generally for common tools (source control, unit test frameworks), consistent product (matching code styles) and for ease of working together. Beyond that, the product is more important than the environment in which it's created.
  • Consistency in Requirements Creation Environments: NO SUCH LUCK. This s one of those things that doesn't get asked about. As long as the product (spec, story, PRD, whatever) is consistent and correct, who cares how it gets there?
So don't enforce standards for their own sake. Figure out what you really absolutely need to be the same, and enforce that consistency. Beyond that, consistency is really rules for the sake of rules.

Be consistent enough to make your life easy, but not so much you make your life hard.

Wednesday, March 12, 2008

Email Is Not a Record

Every day at work, we do something called triage. Basically, we go through the automated test run from the previous night and look at what failed and why. The output of this is a set of bugs (or comments on existing bugs) and a report. The report is an HTML page that we email out to a group of people and that get archives into a specific shared folder. All this is well and good.

Someone asked me today whether a certain type of test had been failing in a certain way. And I went immediately to my email to look at the past several triage reports. In retrospect, that might have been the least efficient place to look! I should have looked at the defect tracking system, but I didn't.

Email as research tool:
  • Pros: Pretty good search (at least in Mail, which is my preferred email program); shows test suite status in addition to individual test errors.
  • Cons: Really really easy to delete or lose stuff; only look at one day at a time
Archived HTML as research tool:
  • Pros: Simple to grep; shows test suite status in addition to individual test errors
  • Cons: Shows one day at a time
Defect Tracking System:
  • Pros: Shows multiple days of failures in each ticket
  • Cons: doesn't show test suite failures, only individual test failures

None of these is perfect, but using email as an archive is probably the worst of it. Reaching for my email as information repository is one bad habit I'm looking to break!

* Yes, I know that triage usually means deciding which bugs are important enough to fix in a given thing (release, iteration, time frame). It seemed easier to change my expectations than to try to rename something so entrenched in the culture, particularly since the name is not a big deal.

Tuesday, March 11, 2008

Find a Thread and Pull It

I've written several times about finding a problem when you're not looking right at it. Sometimes looking around the problem means looking at a different time. Of course we'd all like to find all the problems immediately, but it isn't going to happen. So let it rest, and when you get through what you're working on, well that's when you go looking for the subtle problems.

You find a thread, and you pull it.

In the midst of a release, those niggling little things that aren't quite problems show up. These are the things that are weird but not wrong. So it took a little longer for the upgrader to run, but hey, it completed. Okay, that's plausible. And it's not wrong, per se. But be sure to write it down.

Between releases you'd better come back to those things that were weird. A lot of the time they're just normal things - hey, your upgrader ran slow because you happened to run it on a machine with a slow disk drive. No problem. Sometimes, though, you pull that thread and you find a real deep subtle problem.

So take a deep breath.... and pull.

Monday, March 10, 2008

Prophetic Names

I have historically had no trouble with hardware. Over the years I have had many desktops, laptops, etc. I've never had a true hardware failure... until now.

I named my latest laptop "dramatic-exit".

This weekend dramatic-exit made a dramatic exit - the motherboard, or maybe the hard drive went kerflooey (that's a technical term!).

I'm naming my next computer "rock-solid".

Friday, March 7, 2008

Two Rules for High Pressure Situations

It's going to happen. That dreaded moment when the world falls apart. There's  a big bug. And now is NOT a good time to have a big bug. Now what?

The first rule is this:
Never let 'em see you sweat.

You're the QA engineer. You're supposed to know this system inside and out - what it does, what it ought to do, and how to work with it. In the end, QA knows the system as a whole better than almost anyone. So when there's a problem, in the end it falls to you to figure it out, or at least to figure out how to figure it out. Marshall development, support, whatever you need, but you're the go-to guy right now.

So in a moment where everyone around you is panicking because something has gone wrong and needs to be corrected yesterday, that is the moment you must not panic. Be calm, be level-headed, and think your way through it. Being calm and in control at this point greatly increases your chances of success.

You can do this.

The second rule is this:
Let 'em know you take it seriously.

It's possible to take calm too far. Sure, panicking is bad, but it's just as bad to be perceived as not taking the situation seriously. Making a joke of the problem or putting it below your other priorities - even only in other people's minds - will make them think you're not engaged. This only increases their panic. Be engaged, be on top of the situation at all times. Work the problem and work it publicly. Just don't lose your head.

You can do this, too.

Thursday, March 6, 2008

What Are You Trying to Measure?

I had a conversation yesterday to help plan our next release. Part of the conversation went something like this:

Other Guy: "So what are your performance measurements for this?"
Me: "Like speed? We'll take our usual performance suite."
Other Guy: "No, I mean for the release. Like bugs found. What else?"
Me: "Oh! Well, what are you trying to measure? What are you learning from this?"
Other Guy: "Umm...the metrics for the release? How many bugs got found, how many bugs got fixed."

The problem here wasn't that the other guy in the conversation didn't understand how to measure the success of a release. The problem was that we were talking about how to get the information he needed, not what the information he really needed was.

In this case, he wanted to know what the expectation of quality was for the software, something that would help his group understand the ultimate question: "Is this software ready for release?". Now THAT we can work with. And we came up with a way to measure that information (based around test plan completion percentage, a risk-based weighting of the open blocker bugs referenced to fix rate, and the trend of the find rate).

Metrics is a particularly fraught area of testing (really, of everything). It's fairly easy to generate data. It's a lot harder to generate useful data. This is one case where you need to work from the result back. Figure out what you want to know, then figure out how to measure it.

Identify the what, then identify the how. It works for metrics, too.

Disclaimer: Yes, I know WHAT versus HOW is often a theme of this blog, but it's important, and in practice it's not always clear where the line between what and how really falls.

Wednesday, March 5, 2008

Development Metronome

Every development team has a heartbeat, a metronome that governs the rhythms of development.

Defying the invisible meter that governs development causes consternation and increases your chance of slippage. For example, we develop in two-week iterations - that's our basic time frame.

An Example of Conforming to the Meter
We count releases in number of iterations. Since we tag every iteration, a release is just a special tag (it gets a name!).

An Example of Defying the Meter
We have XP customer team meetings weekly. This is, among other things, a chance to rejigger the priorities. Usually priorities don't change much, but occasionally they do. Sometimes, this means we're changing priorities mid-iteration.

You mean that we're halfway through doing some of these and you want to change it? Enter inefficient processes. Now we have to:
  • Figure out how far through the current (now older) top priorities we are.
  • Put the new priority in place
  • Make everyone context switch.
Sometimes you have to do this, but be aware of the cost. It costs time, it costs code overhead, it costs code instability (a half implemented feature is incredibly unstable).

So when you set your development metronome, try to set everything else - your product management input to development, your test cycles, your releases etc. Make exceptions just that.... exceptions.

Tuesday, March 4, 2008

Time-Based Event Modeling

When you're looking at an easily reproducible problem in-house, solving it is generally fairly straightforward. Simply keep tweaking logs, configurations, gdb sessions, etc. and reproducing the issue until you have the information you need to solve the problem. When the problem is on a system you can only use indirectly - say, a customer system - it gets a lot harder, and experimentation is often out of the question.
One of the techniques I like to use here is what I call Time-Based Event Modeling.
A Summary
In a nutshell, time-based event modeling is constructing a timeline of events, and continuing to construct more and more micromodels of events until you have a plausible handle on what the problem is. This technique assumes we have an "event" - a system crash, performance slowdown, inability to use some feature, etc. Typically, this is what the customer is calling you about, claiming to have found a bug. It further assumes this is something that will require some analysis; the problem isn't straightforward.
Background and Assumptions
Most systems work on patterns. The user does X, which triggers Y. There are also background processes and events - things that the system initiates on a regular basis without end-user intervention. Sometimes problems are caused by the interaction of those patterns and of those events. Doing one thing isn't an issue, and a background process isn't a problem, but when the two occur simultaneously then you get unexpected behavior.
Your goal then, with time-based event modeling, is to find the rhythm of those patterns and see how they interacted. Once you can see the rhythm of the system as a whole, and all its moving pieces, then you can see the break in the pattern and find the event.
So What Do We Do?
  1. Construct a timeline of user-visible events at the highest level. This is what your user knows, and it's a good place to start. Literally write this out with times and dates.
  2. Construct an underlying user-driven timeline. These are the things your system does that were caused by the user actions. Put these on a timeline with the same granularity as the first one you make so you can see what the system is doing.
  3. Add other system patterns. These are cleanup processes, log file rolls, etc. Put these on a third timeline with the same granularity as the first two.
  4. Overlay the three timelines you made. Look at the cycles, and seek out intersections of patterns. Did something change? Is there some sequence of events that occurs together occasionally? How does that relate to what the user sees?
  5. Work out what was special about the system at that time. Did something change? Is there some sequence of events that all intersect at the time of a problem? Did anything take an unusually long time with no obvious explanation?
  6. Look at what patterns failed. Look at the patterns around the time of the problem. Did those patterns repeat sometime when the problem did NOT occur?
  7. When you find circumstances unique to the problem area, then start digging - there's where your problem is hiding.

When to Use This Technique
This is a good technique when we have a non-reproducible set of conditions, or when we can't find a smoking gun. When you suspect a race condition, some over-time deterioration of the system, or the interaction of numerous components, then consider using time-based event modeling.
When Not to Use This Technique
Don't bother with time-based event modeling if the problem is straightforward. For example, if the problem is "I can't burn this CD", and the user is trying to write to a non-writeable CD, then you don't need any complex techniques to figure out what the problem is.

Monday, March 3, 2008

Visual Indirection

There's a classic film technique called "visual indirection" in which the viewer sees not the action itself, but the reaction of other things to that action. Sounds a lot like debugging a customer problem, doesn't it?

You can't actually see the thing itself. You can only see the effects it causes on other things.

So what do we do?

The first and most direct effect is to affect your logs - have the program write out what it thinks it's doing. Sometimes you'll want this only at debug levels, or other nuances, but this can be a help.

The second effect is on downstream components: because A did something odd, then B, which follows, may also behave abnormally. This is where knowing your system flow is crucial - if you can catch B's odd behavior then you can trace it back to A. This kind of effect lends itself to the Five Whys debugging method.

The third effect is on the component itself: because A did something odd, the next time A occurs it will do something odd (either the same odd thing or something else).  To help see these, find a pattern, then find a violation of the pattern. Once you have your violation, look not only at the violation but at the last previous instance of that pattern.

The fourth effect is on concomitant components: because A did something odd, unrelated component C did something unexpected. Race conditions and deadlocks are classic examples of this. These are particularly difficult to pin down, and require you to model your system through time (a decidedly nontrivial task in an uncontrolled environment).

So when you're debugging, don't forget that you aren't looking at the problem, you're looking at the effects of the problem. Look around and you'll find your smoking gun.