Wednesday, December 31, 2008

Perl SOAP Interface to Jira

For reasons that are lost in the mists of time (read: decisions made before I worked here), my company uses a lot of Object-Oriented Perl, particularly for our test infrastructure. One of the little corners of this infrastructure is a utility we call "fetchlogs". It takes some input, optionally creates or updates a bug in the defect tracking system, copies all the test logs from various machines to the appropriate directory, and puts a link to those logs into the bug.

Long story short, I wanted to update fetchlogs to create bugs in Jira instead of our current defect tracking system. I had a hard time with the documentation, so I thought I'd share. Here's the perl module that actually does the work:


##
# Perl interface to Jira defect tracking system
#
# @synopsis
#
# use Jira;
# $jira = Jira->new();
# $jira->addComment(issue => "QA-1",
# comment => "this is a comment");
#
# @description
#
# C provides an object oriented interface to the
# defect tracking system (Jira). It can be used to create issues,
# add comments, etc.
#
# $Id: $
##
package Jira;

use FindBin;

use strict;
use warnings;
use Carp;
use Data::Dumper;
use SOAP::Lite;
use Storable qw(dclone);

##
# @paramList{new}
my %properties
= (
# @ple The Jira host.
dhost => $ENV{PRSVP_HOST} || "jira",
# @ple Port of the Jira Daemon
dport => 8080,
# @ple Should this print the result of requests to STDOUT?
verbose => 0,
# @ple Jira user
jiraUser => "user",
# @ple Jira password
jiraPasswd => "password",
);
##


######################################################################
# Creates a C object.
#
# @params{new}
##
sub new {
my $invocant = shift;
my $class = ref($invocant) || $invocant;

my $self = {
# Clone %properties so original isn't modified
%{ dclone(\%properties) },
@_,
};
return bless $self, $class;
}

######################################################################
# Issue a request to the Jira server and get a response.
#
# @param cmd The name of the command being sent
# @param checkResult Whether or not to check the result of the command
# @param params The parameters of that command
#
# @return result of the command
##
sub _request {
my ($self, $cmd, $checkResult, $params) = @_;
my $soap = SOAP::Lite->proxy("http://$dhost:$dport/rpc/soap/jirasoapservice-v2?wsdl");
my $auth = $soap->login($jiraUser, $jiraPasswd);
my $doThis;
if ( $cmd eq "addComment") {
$doThis = $soap->$cmd($auth->result(),$params->{'issue'},$params->{'comment_obj'});
} elsif ( $cmd eq "getComponents") {
$doThis = $soap->$cmd($auth->result(),$params->{'project'});
} elsif ($cmd eq "createIssue") {
$doThis = $soap->$cmd($auth->result(),$params->{'issueDef'});
}
if ( $doThis->faultcode ) { # whoops something went wrong
croak("Error running command: $cmd\nGot: " . $doThis->faultstring);
}
return $doThis;
}

######################################################################
# Add a comment to an existing issue.
#
# @param params{issue} The id of the issue to add (e.g., QA-1)
# @param params{comment} The comment to add (text only)
##
sub addComment {
my ($self, %params) = @_;
$params{issue} ||= '';
$params{comment} ||= '';
my %issue;
$issue{issue} = $params{issue};
$issue{comment_obj} ||= SOAP::Data->type('RemoteComment' => {'body' => $params{comment}});
my $result = $self->_request('addComment', 1, \%issue);
return $result;
}

######################################################################
# Create an issue
#
# @param params{project} The Jira project name (e.g., QA)
# @param params{component} The component of the project (e.g., "Tools")
# @param params{summary} The title or summary of the issue in text
# @param params{description} A more verbose description of the issue.
# @param params{reporter} The Jira username who is reporting the issue.
# @param params{assignee} The Jira username who is assigned to the issue.
# @param params{priority} The priority of the issue (1-5, 1 is highest).
##
sub createIssue {
my ($self, %params) = @_;
$params{project} ||= '';
$params{component} ||= '';
$params{summary} ||= '';
$params{description} ||= '';
$params{reporter} ||= 'user';
$params{assignee} ||= 'user';
$params{priority} ||= '4'; # Default to "minor"

my %issue;
my $components = $self->getComponentList(project=>$params{project});
my @compList;
foreach my $component (@{$components->result()}) {
if ( $component->{'name'} eq $params{component} ) {
$component->{'id'} = SOAP::Data->type(string=>$component->{'id'});
push(@compList, $component);
last;
}
}
my $issueDef = {
assignee => SOAP::Data->type(string => $params{assignee}),
reporter => SOAP::Data->type(string => $params{reporter}),
summary => SOAP::Data->type(string => $params{summary}),
description => SOAP::Data->type(string => $params{description}),
priority => SOAP::Data->type(string => $params{priority}),
type => SOAP::Data->type(string => 1),
project => SOAP::Data->type(string => $params{project}),
components => SOAP::Data->type('impl:ArrayOf_tns1_RemoteComponent' => \@compList),
};
$issue{issueDef} = $issueDef;
my $issue = $self->_request('createIssue', 1, \%issue);
my $issueNum = $issue->result()->{'key'};
return $issueNum;
}

######################################################################
# Given a project, get the components in that project
#
# @param params{project} The Jira project name (e.g., QA)
##
sub getComponentList {
my ($self, %params) = @_;
$params{project} ||= '';
my %issue;
$issue{project} ||= SOAP::Data->type(string=>params{project});
my $componentList = $self->_request('getComponents', 1, \%issue);
return $componentList;
}

1;

There are a number of gotchas here:
  • I don't have an elegant way to handle the different parameters expected for each command, so my _request routine has a big case statement in it. Not pretty, but functional.
  • It took me a shockingly long time to figure out that the issue number was returned as $issue->result()->{'key'}, but it is. You can get at other parameters this way, too (for example $issue->result()->{'summary'}).
  • In createIssue I tried just passing in the elements rather than using issueDef, but I wound up with all sorts of null pointer exceptions.
  • From here it should (emphasis on "should") just be a matter of launder-rinse-repeat to add other calls for updateIssue, closeIssue, etc.
Many thanks to Google and about 20 posters in various forums for providing clues on how to get this to work. Here's hoping it helps someone else stuck in a similar boat.



EDIT 01/13/2009:
We've added a way to get version numbers.

Here's the method:

######################################################################
# Given a project, get all the versions in that project
#
# @param params{project} The Jira project name (e.g., QA)
##
sub _getVersionList {
my ($self, %params) = assertMinArgs(1, @_);
$params{project} ||= '';
my %issue;
$issue{project} ||= SOAP::Data->type(string=>$params{project});
my $versionList = $self->_request('getVersions', 1, \%issue);
foreach my $version (@{$versionList->result()}) {
print "$version->{'id'} $version->{'name'}\n";
}
return $versionList;
}

######################################################################
# Given a project and a version, get the ID of that version
#
# @param params{project} The Jira project name (e.g., "QA")
# @param params{version} The version of the project (e.g., "4.2.1")
##
sub _getVersionId {
my ($self, %params) = assertMinArgs(2, @_);
$params{project} ||= '';
$params{version} ||= '';
my $versions = $self->_getVersionList(project=>$params{project});
my @versionList;
foreach my $version (@{$versions->result()}) {
if ( $version->{'name'} eq $params{version} ) {
$version->{'id'} = SOAP::Data->type(string=>$version->{'id'});
push(@versionList, {'id' => $version->{'id'}});
last;
}
}
print "Version? -> $versionList[0]\n";
print Dumper @versionList;
my $numVersions = scalar @versionList;
print "We found $numVersions elements\n";
@versionList = SOAP::Data->type('Array' => @versionList) -> attr( {'soapenc:arrayType' => "ns1.RemoteVersion[$numVersions]"});
return @versionList;
}


And here's the call:

my @versionList = $self->_getVersionId(project=> $params{project},
version => $params{version});

Then just do "affectsVersions=>\@versionList" wherever you need it.

Tuesday, December 30, 2008

Learning How?

I wrote yesterday about lists - how I keep 'em, where I keep 'em, where I wish I could keep 'em. And immediately I got a couple of comments suggesting mind maps. Glenn in particular mentioned that he used mind maps both for a "to do" list of sorts and also to help him assimilate new information.

Which got me to thinking.... there's a lot of new information out there. So how do I learn? And how do I help my team members learn?

Diagram
This can be a flow diagram, a state diagram, a deployment diagram, whatever. It's a picture, and it's a picture of how the system works. I learn best if a diagram is constructed in front of me - that way I can follow along as it evolves. In my head, I can actually see the bits move through the system like I'm watching a movie (literally, in my head an HL7 message comes in a cute little envelope in kind of an off-white stationary).

One guy on my team uses diagrams, too. He goes about it differently, though. He likes the whole diagram at once, and then he just... assimilates it. I have no idea what's going through his head.

Analogy
Okay, this one my team laughs at me for. I love using analogies, particularly when I'm dealing with people who aren't deep in the system. For example, if I have to explain a technical concept (say, why a system turns a node off when it detects multiple correctable memory errors) to a sales guy, I'll use an analogy. I could explain what a correctable memory error is, and then cite the research showing that correctable memory errors are followed by uncorrectable memory errors X% of the time, and then mention that an uncorrectable memory error may indicate data loss.... (watch the eyes glaze over). Or I could simply say that the uncorrectable memory error is like the check engine light in your car. Your engine hasn't failed, but that check engine light is an indicator that something's failing, and it's better to get it checked out than to simply keep going and wind up broken down at the side of the road. The correctable memory error is our check engine light, and turning the node off is our system's way of getting it to our mechanic - the support team.

I don't know if analogies actually help other people as much as they do me, but I find it a useful way to relate what's going on in a system to something that people already understand. It misses some details but can give a good idea of the basics of what's going on.

Verbal Explanation
Okay, I'll say straight up that this just doesn't do it for me. But some people on my team really get stuff when we talk through it. For this to work, though, feedback and repetition throughout are important. Don't go into a 10 minute spiel. Go through 2 minutes of explanation, then get confirmation from whoever you're talking to that it sunk in. Get that person to repeat the information back in his own words.

Written Explanation
This style works better for me than talking through something. I like it list-like, though. Here's idea A. We get here from B or from E, and from here we could go to C or D or F, or we could repeat A. This written explanation also has a major advantage - it can be passed along indefinitely without having to have the one-on-one contact of teacher and learner.

Do It With Tutor
This is what pairing really promises to provide. Let's not talk about it, let's do it. The gotcha with this one is that I've only ever seen it work if the person doing the learning is the person driving the keyboard (or doing the action). Otherwise it's way too easy to gloss over a step.

Hands down, this is the best tool for learning to do things that I know. It obviously works less well for abstract concepts, but it's great for the "how do I replace a hard drive?" or "how do I create an SSH tunnel?" problems.

In the end, how you learn and how you teach are two things that are going to be driven by whoever's doing the learning. Having lots of different ways to approach a problem, though, and being able to switch between methods only increases your chance of success.

Monday, December 29, 2008

Making Lists

I'm a list maker. Particularly when I'm working on a project that has a lot of dependencies, I find they come in at all different times. And I need a list to tell me what I can and can't do at any given point. Otherwise I spend a lot of time looking to see if I can do something instead of just doing the things I can do.

It's really simple: 
- task (need ___ from ___)
- task (need ___ from ___)
- task

Lists are my hero.

The problem I have is that I haven't found a good way to handle lists. I've tried sooo many list programs:
- a text file
- Dashboard stickies
- OmniFocus
- OneNote (on Windows, of course)
- Mac Mail tasks
- Campfire and variants
- Who knows... many more.

So what do you other list keepers use?


ETA: I should point out that I consider the defect tracking system one of my lists! I just need the status information...

Friday, December 26, 2008

Hashes In Shell

I've been working on a shell script recently. It's purpose isn't important, really. What is important at this point is that I found myself with a need for a hash map. In shell.

The best way to do a hash in shell is to not do a hash in shell.

I moved to Ruby, but you can move to whatever you like. But you have two choices in this situation: (1) simplify your implementation; or (2) change your language.

This is one of those times when working with polyglots helps.

Wednesday, December 24, 2008

Pause, Then Change

There's a lot to be said for dreaming big. There's a lot to be said for daring to imagine a complete system. However, all the dreaming and daring in the world won't help if your vision keeps changing beyond where you've gotten.

Vision is good, and changing vision is reality. But make sure you bring your work to a logical stopping spot along the way.  Here's the thing - if you change your vision and immediately start on it, you'll never get anywhere. Instead, change your vision, get to a usable stopping spot, and then start working on the new idea.

Please note that I'm not talking about technical failure. Technical failure implies that whatever you're doing will never work. This is vision change, which means that doing some still has value. So provide yourself value, even as your dream grows.

Tuesday, December 23, 2008

Lessons From the Snow

We got some snow here this weekend, and then it turned cold. Inevitably, people wound up parking on a ice... errr... snow bank. I passed a guy yesterday trying to get out of his spot...


... and he just kept gunning it (the cursing was pretty salty, too!).

He dug himself deeper and deeper and just wouldn't stop gunning the car. He'd stopped making any progress and was just sitting in place expending a lot of effort - all because he was in a hurry and he was getting frustrated.

And it occurred to me that when we test, we have to be very careful not to turn into that guy. Once you've got your system into a bad state and found a problem, continuing isn't going to help. Get yourself in trouble - that's part of our job. But then don't keep doing the same thing over and over. If you want to accomplish something, change your approach.

Otherwise you're just digging a hole in the snow.

Monday, December 22, 2008

Un-Special

One of the issues I hear about over and over in engineering is specialization. Generally, this falls into the category of "specialization is bad". The idea is that a team is rather like a set of Lego blocks - when you're building a wall you can put just about any Lego in and make some progress. Eventually we'll get our wall.


Now, the theory goes that we can do this because we're generalists. After all, we're all pretty good engineers, right? We can pick up some code, understand it, and modify it to suit our changing needs. We can make new code that fits into the system. We can all code, we can all test, we can all manipulate a database! We are that good!


Except we're really not. Despite the best efforts of Gantt-chart wizards and velocity measurers, software development doesn't have a defined set of therbligs. Instead, we are all self-selecting specialists. We each have our own way of understanding a component of our system and working with it. There are some commonalities, but we're not interchangeable yet.

However convenient it may be for managers to have interchangeable resource units (that's Gantt-speak for "humans"), let's remember that's not the goal. The goal is to produce good software. To do that I don't need individual human generalists. I need a generalist team - one full of people that are good at most things and extremely good at some things. From there it's up to the team to make sure that tasks are balanced to finish on time - whether people accomplish that by doing things they're really good at, doing things they're pretty good at, or learning new skills.

So don't ask for generalization. Ask for what you really want - software. Good software. Good software when you say you're going to have it.



* Disclaimer: I don't want to point fingers at any methodology. This generalism theory is so remarkably common that it wouldn't be fair to single out any one methodology.

Friday, December 19, 2008

I Believe

I believe....
  • In public praise and private blame.
  • That we succeed or fail as a team, and that's all the world needs to know, not who did or didn't do what.
  • In freedom of action coupled with documentation.
  • That nothing is truly done until you've shown someone else how to do it.
  • That how I ask a question is at least as important as what question I ask.
  • That everyone on the team - including me - has things to teach and things to learn, and that only in both teaching and learning can lasting satisfaction be achieved.

Thursday, December 18, 2008

Whoops!

Everybody screws up. I screw up. You screw up. Even the Queen of England screws up. Shoot.

When someone who works for you screws up, there are two things that can happen. Either the employee realizes he screwed up, or he doesn't. Either way, your job as a manager is to notice the screw up and correct it, either by doing something or - and this is sometimes harder - by doing nothing.

Depending on the person making the mistake, there are a range of possible behaviors and things you have to do about the situation.

Oblivious
Sometimes the person has no idea he's screwed up. Seriously no clue that a mistake was made at all or that he caused it.
Reaction: If you don't point out the mistake, he'll repeat it eventually. However, what you really want is for the person to start noticing his own mistakes. So don't point it out directly; instead lead him down a path so that he comes to his own realization that he made a mistake.
Sample Phrase: "I noticed that you didn't clean up the machines when you were done with the test. Were you still using them?"

Hide and Seek
Sometimes the person recognizes that there was a mistake and tries to cover it up.
Reaction: If this person hides the little mistakes, the big ones will only blow up on you later. You have to point out the mistake, yes, but the more important aspect is making sure the person understands and - and this is important - feels safe enough to acknowledge mistakes. Be extremely gentle with the mistake itself, at least the first couple of times.
Sample Phrase: "I noticed that the memory tester was broken in half under the keyboard in the QA lab. We can get it fixed, but you need to come tell me that it's broken so we don't have to go hunting for it."

Penitent
Sometimes nothing you can say is worse than what the person who made the mistake is already thinking. Left alone, this person will beat himself up for every screw up for longer than you might.
Reaction: Punishment is already happening; your reaction needs to be about acknowledgement, correction, and speedy moving on.
Sample Phrase: "Yup, you that bug is a duplicate. Let's figure out how we can make searching the defect tracking system easier in this case. [do so] Okay, now we've improved our process. What's next?"

Self Corrector
Sometimes the person acknowledges the mistake, corrects it, and moves on to the next thing.
Reaction: This person screwed up and dealt with it. Anything you do will prolong the incident unnecessarily.
Sample Phrase: None.

When you're working hard, and you're tackling challenges, mistakes will happen along with triumphs. Making mistakes okay - both to make them and to deal with them - is just one of the things you need to do to keep your team taking risks and getting better at what they do.

Wednesday, December 17, 2008

Found In Fixed In

One of the standard items in bugs is the "found in" field. This indicates all releases that you know of that have the bug. Another common item, although slightly less standard, is the "fixed in field". This indicates all releases or all branches that contain the fix.

First level of smarts.... use releases
Found in and fixed in are fields that are typically not updated that much. You describe it when you find the bug, update it (maybe) when you've diagnosed the bug, and add "fixed in" when you verify the bug. The biggest use of these fields comes after the bug is closed. This gets looked at when, for example, a customer issue comes up and someone wants to know if a bug existed in a given release. What you really care about capturing easily is what releases the bug can be found in, and what releases a bug is fixed in. In many development environments, this translates to branches (with one or a few releases per branch). So don't worry about build numbers - those can go in comments. Worry instead about releases.

Second level of smarts..... multi-select
In some defect tracking systems (RT, for example), found in and fixed in are single-select fields. This is wrong. You can certainly have a bug that is found in more than one branch of code. And you can have a bug that is fixed in more than one branch of code. For example, maybe a bug was fixed on head and the most recent release branch. The bug should note every release that is known to contain the problem and every release that is known to fix the problem (note that "head" means "all go forward releases branched and taken from here").

Third level of smarts... different resolution types
This is the level that I have yet to see a defect tracking system actually support. When you're working with a bug, you may find it in the 3.5 branch, the 3.7 branch and head. And then you may decide to fix it on head and the 3.7 branch, but not to fix it on the 3.5 branch since you're not intending to release 3.5 again. How do you close the bug? Is it fixed? Well, yes. Is it deliberately not fixed? Well, yes. Or did someone just screw up and accidentally not mark the "fixed in 3.5"? In this case, no, but we're all human and that's going to happen some day.

You basically need one resolution type per "found in". Think something like this:
Found In       Resolution
3.5                     Won't Fix
3.7                     Fixed
head                  Fixed

Anyone know of a defect tracking system that does this? Or is it good enough to just put this information in the comments in case you need it later?

Tuesday, December 16, 2008

Round 2: Fight!

Everything you do is a compromise. Some of these are obvious:

"Customer Foo has a big problem, and we're stopping everything to fix it!"

"Well, you asked for more X and you want faster Y, but we only have two more weeks before we cut the release. So which one do you want for this release?"

Others are not so obvious:

"Here are two faster processors. Does that help our performance?"
(HINT: Does that help our performance doing what? There are a lot of options for what to try first here.)

"Did you test this yet?"
(HINT: This means "I really hope you hit this one and decided that it was more important than the other thing you were going to test".)

It's not that you can do only one thing at a time; it's that your total capacity doesn't grow just because your customers (sales, support, product managers, your own itches) have desires. In general this is an accepted thing in software development: 

resources x time x productivity = output

The trick is that sometimes the conflicts get lost. If a customer asks for something one day, and the next day someone else asks for something, it can get hard to recognize that you're approaching (or passing) the tradeoff point.

It's all too easy to get into a game where the most recent request is a bully and pushes the other requests to the bottom of the queue.


Bad bully!

Only with transparency can you truly make the tradeoffs. Scrum solves this with the backlog. XP solves this with 3x5 cards. The trick is helping the team - all your customers - understand what they're asking for. A request is asking that you do something, and also that you not do something else. Understand that you're committing to both parts of that when you say yes to a request. With that acknowledgement you can work toward balance, and get some of everything you want.

If you're really careful, everything will work well not only on its own, but together to satisfy all your customers.

Happy compromising!

Wednesday, December 10, 2008

Candidate Woes

We got a resume for a candidate today, and the recruiter's spiel included the following statement:

[The candidate] "prides himself on doing what his bosses tell him to do."


Ouch.

Compliance with directives is really not a selling point.

Tuesday, December 9, 2008

Knowing "Normal"

One of the things I preach on is spending a lot of time looking at the system, particularly through the logs. Depending on your system, substitute something appropriate for "logs" - reports, database, network traces. The point is that you need to be looking under the covers.

Why do we look under the covers?
Immediate problems are the kind where you try something and it fails. For example, maybe you shouldn't have permission to delete a file but you can delete it anyway. These are generally fairly easy. It's really just a matter of identifying the relevant system state and then performing the failing action.

Then there are indirect problems (I've also heard them called second-order issues). These are the things that go wrong only because some other thing or things happened. So you can be in the same state multiple times and depending on what else has happened, your issue may or may not reproduce. It depends on whether you've triggered a time bomb or not. Figuring out what else has happened is the trick here, and that almost always requires that you look under the covers. Your answer, or at least the path to your answer, is in the logs (or network traces, or whatever).

Great! What are we looking for?
Here we're looking for one of two things:
  • A pattern. The pattern might be subtle - some sequence of actions or states - or not, but keep in mind that not all system actions are user actions.
  • Anything that isn't normal. Something out of the ordinary happening - garbage collection earlier than it usually does, or maybe a database index not present when it usually is - can be a a big clue to the problem. These are often subtle, but tracking them down is usually worthwhile.
"Normal"?
Yes, normal. If I'm looking for things that are not normal, I have to know what normal is for my system. I'm not going to notice that a log rolled 5 minutes early if I don't know how often logs usually roll. I'm not going to notice that my checkpoint was late if I don't know they happen every hour at 1 minute 28 seconds after the hour. I'm looking for deviations from normal, and I'm only going to know what the deviations are if I know what normal really is.

This is where your notes come in. Remember how we talked about noticing with purpose? Here's where you haul those notes you took out and start looking for a pattern. We're getting under the covers and looking to find out what patterns exist in our system and what patterns matter.

The single best way I know to start to see these patterns is simply to live in the logs (or network traces, or whatever) for a while. And then talk about it:
  •  Explain what's going on in a section of the log to yourself or to someone else. Do it out loud so you can't gloss over something.
  • Follow a thread or process and figure out it's periodicity. Repeat with another thread or process.
  • Do an action and describe everything that happens with that action. Do this one out loud, too.
Eventually you'll start to understand the system as it typically behaves. Once you have a feeling for this "normal", then and only then can you see deviance from pattern. So find your normal, and then go find your issues.

Monday, December 8, 2008

Selenium Grid Update - Handling Dependencies

I set up a Selenium Grid configuration a while ago. It pretty much ran stuff, pointing back to the central grid computer for its server. This was nice, but boy the dependencies!

To run the test, look at all the things we had to do:
  1. update the source code
  2. start the server (script/server)
  3. start the hub
  4. start each of the remote controls and let them register with the hub
  5. finally! run the tests
This weekend I worked on scripting those dependencies (hey, the less I have to mess up, the better).  Here's what I wrote:
  • an init script for selenium-hub that starts the Ruby server and the hub
  • a new environment for the grid (not strictly necessary but it gave me a place for some URLs for third-party dependencies that change by environment in the same place)
  • a rake task to actually do the work of updating source code, etc.
So how does it all work now?
  1. Log in to the hub box
  2. set RAILS_ENV to the environment I'm interested in
  3. call the rake task
  4. Wait until it prompts me
  5. Start the remote controls
  6. Press enter and let the tests run
The trick is to make rake do all your prep work for you. Here's the interesting part of my task:


unless ENV.include?("platform") && @platforms.has_key?(ENV['platform'])
raise "Command line parameters incorrect! \nusage: rake test:run_browsers platform=[win|osx]"
else
platform = ENV['platform']
case platform
when 'win' then
browsers = @win_browsers
when 'osx' then
browsers = @osx_browsers
end

# Write out our environment
puts "Our environment is #{ENV['RAILS_ENV']}"

# Set up our db
puts "Migrating database and loading fixtures"
begin
Rake::Task["db:drop"].invoke
rescue
"Database does not exist"
end
Rake::Task["db:create"].invoke
Rake::Task["db:migrate"].invoke
Rake::Task["db:fixtures:load"].invoke

# Restart the environment
prepEnvironment(browsers)

# Run tests
puts "Running tests for the #{@platforms[platform]} platform"
browsers.each do |browser|
puts "Running tests in the #{browser} browser"
ENV["Browser"] = browser
ENV["TEST_DIR"] = 'test/selenium'
puts "RUN TESTS HERE"
Rake::Task["test:selenium"].invoke #run all the selenium tests
end

def prepEnvironment(browsers)
puts "Restarting Selenium Grid Hub and Ruby server"
system("/etc/init.d/selenium-hub restart")
sleep 5
print "\n"
print "\n"
print "\n"
print "Go start your Remote Controls. Then press enter to continue."
continue = STDIN.gets

host = "localhost"
port = "4444"
path = "/console"
data = checkSeleniumHub(host,port,path) # Hub is running on port 4444
checkNeededRCs(data,browsers) # All needed RCs for the platform are running
puts "Remote controls are running. Proceeding to test."
end

All I really did was take the rake task and make it do its own preparation. Saves time, and makes my tests run a lot more consistently. Hope it helps anyone else who is trying to get Selenium Grid running without human intervention.


Oh, and may favorite part of this is the STDIN.gets call  (translation: wait for the human to go do the part she hasn't automated yet!).

Friday, December 5, 2008

Lots of Little Things

When you've been working for a while, you build up quite a crop of tools, utilities, scripts, etc. These are the little helpers that make things easier. Maybe one grabs all the logs from all the machines you were using and puts them in a central location. Maybe another creates a ticket from an automated test. A third might give you a list of all failures in last night's test run. A fourth might move a machine into the "I'm broken please fix me" queue so no other tests can grab it and fail.

All the little scripts are great, but it gets really easy to make a mess with them. Let's say I have a scenario:
I want to take all failures from last night's run, grab their logs, create tickets, and move the machines they were using into the "I'm broken" queue.

That's just a compendium of my little tools - awesome! Should be easy to write a script that ties each of these together...

And wind up with a mess.



All those utilities and scripts can be combined, sure, but unless you're really careful how you do it, you're going to wind up with layers of flakey, kinda crufty scripts that barely hang together.

Think of it like painting a room. You can paint one wall (call it your "accent wall" - and for the record this is one of those things I find kind of odd, but it's an analogy so we'll go with it). Then you can paint the rest of the walls a different color. Then you paint the trim. And then, well, maybe the accent wall would look better the same color as the other walls, so you paint it. And before you know it you've got a lot of thick, cracking, probably peeling paint. Not good.

The proper way to paint a room includes stripping the old stuff before you write the new stuff. And the proper way to write code includes refactoring to accommodate new needs rather than just adding more. This applies to your quick little tools, as well.

No matter how small your script is, if you're modifying it, consider refactoring it. It'll keep your tools useful for a lot longer.

Thursday, December 4, 2008

From Here To Should

Check out this Motivation In the Workplace report. Basically, it attempts to break down people in the workplace by their types - "mother hen", "joker", "dude", "realist" etc. Then it tries to show how those types may react and should react when things aren't going well.

Grouping people into types is pretty common. You can call it "joker", or INTJ, or a blue parachute. This particular grouping is cute but not the point. What I found interesting was the assertion about how behavior is likely to change as things get tougher, and how behavior should change. Their claim is that people will moderate their behavior as things get tougher - the joker will make fewer wisecracks, the realist will get more pessimistic, etc. Further, they assert that it will only make people more nervous; instead the joker should keep joking, the realist should keep pointing out both risks and opportunities, etc.

There's one thing that immediately leaps to my mind: it's really easy to say "we should". It's often a lot harder to actually do it. And the report offers no ideas for how to achieve the things we should be doing (presumably the longer private report does).

Knowing what you should do is only the first step. It's no good without knowing how you are going to achieve it.

Wednesday, December 3, 2008

Negative Acceptance

This is the story of a bug, a hack, and a fix.

We start with the bug. It's the kind of bug that gives you nightmares: nasty effects, just reproducible enough to be hard to track down, any fix is going to be pretty large and quite risky.

And then the hunt. Several weeks of finding it, missing it, finding it again, and narrowing it down and down. Finally, QA gets it. Granted, it takes two days to reproduce, but it reproduces every time.

By now we're getting close to the scheduled release date. There are a couple of ways to fix this, none of them nice - a new kernel, or writing a new NFS component.

And then... the hack. It's not pretty, no one likes it, but it's fast to implement, isolated, and doesn't change major components of the system. So management says, "let it be so" and the hack goes in. It works just fine. The problem is successfully worked around.

And then the fix. Better architecturally, more elegant, a much larger change, but The Right Thing To Do.

But wait.... a joke! Someone says, "We're going to forget to take out the hack!" General laughter, but what if we really do forget?

This is where we use negative acceptance criteria. Acceptance criteria are usually affirmative - "user can log in", or a lot of tests around writing files with certain ACLs. But you can also accept the absence or negative of something. In this case, we do not accept the fix unless the hack is gone. We can prove the hack is gone (in this case by checking that the package is not installed and that the process does not run on startup).

When you're working on accepting a story or a feature, don't forget to consider the acceptance of removals as well as of additions.

Tuesday, December 2, 2008

Self-Selecting Specialists

One of the principles of many agile (and agile-like) methodologies is the idea of generalist resources. This is a project manager's dream - anyone on the team can do any of the team's work! I've never seen this actually happen.

In practice, the people I work with are what I call self-selecting specialists. Since we are committing to work as a team, then the team gets to decide among itself who actually does what part of the work. And when you give the people on a team the ability to choose what they do, I've always seen it result in specialization.

Take some of the typical mix of tasks in my current QA team:
  • fix a bug in our automated test report
  • an enhancement to the log fetcher that grabs logs from a set of machines you specify
  • create a code interface to manipulate some third-party software (and we'll later write regression tests that use this)
  • poke at the failover feature in conjunction with writes over a certain protocol
  • run some tests comparing the performance of various writing programs on Windows (copy, drag-and-drop, Robocopy, etc)
If the team gets to pick, you can pretty much guess who's going to pick what. The guy who considers himself a software engineer (just moonlighting in QA) will want the enhancement to the log fetcher. The guy who likes designing test system and wants to learn object-oriented perl will volunteer to create the interface to the third-party software. My more junior guy will want the performance comparisons (since they're structured and he knows he knows how to do that). Etc....

Yup, in theory we should all be generalists. But we're all human: we do what we enjoy doing, we do what we think we're going to be successful at, we do what feels like just a little bit of a stretch. We end up specializing not because someone told us to, but because we chose to do it.

And - other than our beleagured project manager and our process zealots - I think that's okay. To me, this is a prime opportunity to show off the powers of pairing. Let the specialist do the work that he enjoys, and pair someone else with him so that the other person has exposure. You still get the speed and reliability benefits of specialization, and you also get the increased coverage and learning that generalists incur.

So here are the two heuristics we try to use for specialists:
  1. You wanna do something you "specialize" in? Great. Do it.
  2. You have to pair with someone who does things you're weaker in, and you have to drive the pair. I don't care when or what, but you have to do it for at least 8 hours a week. That way you get the fun stuff that you like and are good at, plus you're also learning.
At least for the teams I've worked on, self-specialization is a choice that pretty much every member of the team has made. I'm not going to fight it. Instead, I choose to use that self-selection as a solid base, and nudge each team member to make that base bigger, working always from the person's comfort zone.... and using pairing to just expand that comfort zone a little at a time.

Monday, December 1, 2008

Notice With Purpose

When we're testing, particularly when we're doing exploratory testing, it's a very intense level of interaction with the system. There's a lot going on with any reasonably-sized system, and we have to be watching for tiny bits of information in a sea of data. So we keep our senses on alert, watching logs,  GUIs, timing, messages in third-party systems that interact with our system....

Senses..... alert!


But noticing is only half the battle. We have to notice with purpose. Not everything that happens is a sign of a problem, so it's important to filter out things that are unimportant. For example, I can be testing a system and notice that another tester is writing to my system. Is that important? Probably not, really. So what's the workflow here?
  • Notice something. For example, someone else writing to my system. This is anything that jumps out at you - a log message, some GUI behavior, some metric.
  • Decide if you caused it. If I wrote data and then noticed that I wrote data, well, that's probably not interesting. If I was trying to cause an error to occur and it occurred, that's good to know but probably not something worth spending too much time on.
  • Decide if it's standard background tasks. Many systems have normal things they do, like log rolling, etc. That's also not likely to be interesting. Be careful here, though - a standard task at an unusual time or in an unusual way is NOT standard.
  • Write it down. It may not seem important now, but it's worth a quick note in your testing notes.
  • Pursue it if it has potential to be related. Here's the real trick. If you chase every odd thing you notice, you'll never get anywhere.  So notice, and only follow up if you know what you're going to do with the follow up. For example, if I've noticed that someone else is writing to my system, I'm not going to follow up if what I'm testing is user creation in the admin. Writing simply doesn't matter to that (at least, I'm 99.3% certain of that). I will follow up if what I'm testing is data deletes and I'm seeing odd things going on.

There's simply too much information for it all to be currently relevant. This applies across many disciplines. In exploratory testing we call it mission-based tests, or directed tests. In meetings it's often called the parking lot. In  management we call it exception reporting.

Noticing things is good.... that's how you find the subtle bugs. But noticing everything will just leave you overwhelmed; you simply can't pursue it all. So notice, decide why you noticed, and move on with or without that thing. Notice... with purpose.

Wednesday, November 26, 2008

Smarter Different, Not Smarter More

One of the standard questions I get asked by candidates is:

Why do you like to work here?

This is the candidate equivalent of "tell me your greatest weakness." Hey, all sides in interviews get our cliches!

And just like any well-prepared candidate, I'm prepared for that stock question. The answer is pretty much always:

Because I really like working with people who are smarter than I am.

Sounds pretty good, right? Flattering to the candidate (you, too, could be in such a smart group!), a cliche answer to a cliche question (keeps the candidate comfortable), and is an honest answer (I have a lot to learn from my coworkers).

Except it's wrong.

"Smarter" doesn't really mean anything. And I don't honestly want to be the dumbest person in the room (talk about a blow to the ego!). What I really enjoy is working around people who can teach me something. I also enjoy working around people to whom I can provide insight and information. I'm happiest if it's a two way street - we're all learning!

What I really want is to work around people who know different things than I do. And then I want us to teach each other.

Yeah, that's what I like about my job.

Tuesday, November 25, 2008

Tricky Time

I was looking at logs yesterday, and we noticed something odd in the syslog (I've edited this to make the problem more obvious):

Nov 18 12:03:16 portal-02 postfix/master[454]: 
Nov 18 12:03:20 portal-02 kernel: nfs
Nov 18 17:03:27 portal-02 apphbd[2763]: WARN: 
Nov 18 17:03:27 portal-02 apphbd[2763]: WARN: 
Nov 18 12:03:53 portal-02 kernel: nfs:
Nov 18 12:03:54 portal-02 kernel: nfs
Nov 18 12:04:17 portal-02 postfix/master[454]:
Nov 18 12:04:17 portal-02 postfix/master[454]:
Nov 18 17:04:26 portal-02 stunnel[24095]:
Nov 18 17:04:26 portal-02 stunnel[24122]:
Nov 18 17:04:26 portal-02 stunnel[24095]:
Nov 18 17:04:27 portal-02 stunnel[24122]:
Nov 18 17:04:27 portal-02 stunnel[24125]:
Nov 18 12:05:18 portal-02 postfix/cleanup[10163]:
Nov 18 12:05:18 portal-02 postfix/cleanup[10163]:

See it?

The time is "jumping around". It starts at noon-ish, and there are some entries at 5pm, and then some more at noon-ish. Very weird.

It took a good couple hours to track this down. And I should note that this is a Debian Linux syslog.... have you figured it out yet?

.
.
.
.
.
Hint time:
There are two things you need to know:
  • That time stamp is the local time of the process that has the event.
  • Processes set their time zone (their local time) when they start.
.
.
.
.
Got it yet?

The time zone (/etc/localtime) was changed after boot. Any processes that were restarted  - apphbd and stunnel, in our example - got the new time zone. Any processes that didn't restart stayed in the old time zone.

Seems simple once you know what's going on!

Monday, November 24, 2008

Reconciling Plan Length

I - and many of my friends in software - work in fairly short cycles. Two weeks is the most common, really. Whether they call it SCRUM, Agile, XP, or "what we do", the process works basically the same:
  • An ordered list of things to do is provided to development
  • Development estimates the work
  • Dev and the customer (or customer proxy) draw a line at two weeks and that's what dev signs up for
This works pretty well in the short term. However, translating this to a longer work schedule is more difficult. With this process, how do you know whether you're on track for a deliverable that's 5 months away?

Okay, let's start with the process zealots: Yes, your customer should be committed to the process you're using, and should be working with the two week cycles.

And now in the real world.... we have more than one customer, and these customers have requirements and cycles that are larger than the two week development cycle. Ultimately, they need to plan development and rollout cycles in months or even years. 

I haven't actually solved this problem; I don't know how to reconcile the need to have a feature in three months and rolled out in six months with the two week planning.

Things we've tried, with various degrees of success:
  1. Stick it at "about the right place" in the backlog based on velocity and adjust a bit as it gets closer. This one often winds up with starting it a bit too late due to an overoptimistic idea of how long it will take.
  2. Create an earlier task to estimate the item, then stick it in the right place based on velocity and estimates. This works a bit better but still suffers from optimistic estimates.
  3. Put the item early in the backlog so there's plenty of time. In practice this is tenuous if you have more than one client or more than one thing going on.
What else have you tried to reconcile your development cycle with a client's longer-term plans?


Friday, November 21, 2008

You're Doing It Wrong

The last refuge of the process zealot is "it's not working because you're doing it wrong".

I've read several blogs recently, and been at or had friends I trust at several companies, and they all keep coming back to: "It's not working!"... and the answer from proponents of that process (XP and SCRUM mostly) is "Well, you're doing it wrong."

That's not helpful.

If someone's "doing it wrong", tell them that. And then tell them what they're doing wrong and help them fix it. Otherwise, you're just part of the problem.

Thursday, November 20, 2008

Where You Write It

I've written before about things that are "just known". This is what we call institutional memory, but the problem is, institutional memory is transient. People come, people leave, people forget.

One of the ways to get around the problem of things that are simply "known" is to write them down. That's not the only trick, though. There are lots of places to write things down:
  • Email: This one really doesn't work very well. It requires either the sender or the recipient to still be there later and remember that the information is available.
  • Document: This is better than email, but make sure you store it somewhere centrally. Keeping it on your laptop has the same problem as email - others can't get to it.
  • Document on a Central Server: This is accessible, but not the most easily searchable.
  • Document in a CMS or Wiki: This is generally the easiest to update and to search. However, formatting isn't the best.
In the end, as long as it's accessible to the entire team, pretty much any method works. Just make sure it's consistent.

Friday, November 14, 2008

Not a Mind Reader

Late in a release cycle, there's an inevitable conversation:

"You mean you just found THAT?! That's been wrong for a long time! Oh, we've GOTTA have that one fixed."

Okay.

There are two things  that really grate about that statement:
  1. Age of bug does not correlate with defect priority
  2. If you've known about it, why isn't it in the defect tracking system?
Bug Aging and Priority
A bug's age has nothing at all to do with its priority. A really old bug can still be low priority. Conversely a brand new bug can be low priority. Bugs can change priority through time, but that's not age specifically. Instead, a bug changes priority as usage patterns and features around the issue change.

For example, if you have a bug in your Active Directory integration and you're selling into UNIX shops, that bug is probably low priority. When your sales team starts landing major customers who have mostly Windows environments, that bug might become higher priority. Why? Because now you're more likely to hit it in the field.

If You Knew....
There are a lot of different groups that might define a particular behavior as a bug - sales, support, development, QA. Just because one group finds a bug doesn't mean another group has any clue that the defect exists. Even if the behavior is known, one group might assume that behavior is correct while another group considers that behavior absolutely ludicrous.

Enter the defect tracking system. This system is not the exclusive domain of QA, or even of development and QA. Here's the amazing thing: anyone can enter a bug! So if support feels that something is a bug, they should enter it. Same goes for sales, development, QA, anyone. From there the bug can go into a standard triage process. But if the bug never gets in, it's not going to be fixed. 

I am not a mind reader.


So if you'd like a bug fixed, great. All it takes is two simple steps:
  1. Log the bug
  2. Explain why you think it's high (or changed)  priority.
If you don't do those two things, don't expect other people to automatically know there's a problem and fix it. Be an active part of the process; your results will improve immensely.

Thursday, November 13, 2008

It Hurts When I Do This

In test there are good days when everything does pretty much when you expect it to. And there are bad days when it seems like no matter what you do - even something you did just fine yesterday - just doesn't work. 

On those days, being in test is kind of like being that guy who goes to the doctor and says, "Doctor! It hurts when I do this!".................. all day long.

It's a recipe for frustration by noon.

So calm down. Grab logs and whatever diagnostic information you need. Go for a 5 minute break. And then re-baseline your system and start over fresh. You, my friend, have entered a bad state and continuing to try won't make it better. Oh yeah, and anything you do find in this state is highly likely to be hidden by fixing the issue that got you into this bad state. If you have your root issue, you're not helping. You're just repeating the thing that makes you say, "It hurts when I do this."

And don't forget the last part of the joke; it's relevant:

Don't do it any more!

Wednesday, November 12, 2008

"Word Game" Analysis

You know that word game where you change one letter at a time to turn a word into another word? Each middle step must also be a word, and you can only change one thing at a time.

Like this:

HATE
HAVE
HOVE
LOVE

Turns hate into love (yeah, I know, it's a cliche, but it was an easy one).

When you're trying to track down a problem and your initial analysis is getting you nowhere, it's time to start eliminating variables. To do this in a systematic manner, try the word game. How do you get from their config to your config? One change at a time... oh, and each one must be valid (a "word" in our analogy).

For example, dev and QA were seeing very different results on the same test. So how do we get from dev's environment to QA's? Let's lay out our "word" from dev to QA:

Version       Fullness    Type             Data                Encoding     Size
HEAD          0%             from code    10 GB file A    M                 4 node
.
.
.
.
.
4.2               75%             from CD     35 GB file B      E                 16 node

Now we have a map. Instead of randomly trying configurations, we're going to walk from dev to QA, changing one thing at a time and seeing what the results of our tests are. It's not about trying every configuration. It's about giving yourself a structure so you can hone in on a solution with as few different steps as possible. It's a way to look for progress and identify areas that are likely to cause a problem.

Let's look at our "word" again, this time with test results:

Version     Fullness     Type            Data                 Encoding     Size              Result
HEAD       0%               from code   10 GB file A     M                4 node          15
4.2             0%               from code   10 GB file A     M                4 node          15
4.2             0%               from code   10 GB file A     E                 7 node           13
4.2             0%               from code   35 GB file B     E                 7 node           12.5
4.2             0%               from CD      35 GB file B     E                 7 node           8  
4.2             75%              from CD     35 GB file B      E                 7 node           7.2
4.2             75%              from CD     35 GB file B      E                16 node          7.2

Looking at our output, we now start to get an idea of what areas are actually making a difference in performance. In our example, it turned out that installing from CD gave you a bit of a different configuration than running from code. Fix the CD installer (the code was correct) and the results for CD-based configurations started matching the code-based configurations.

Use this word game technique when you're facing a lot of variables and no real indications as to what is important and what is not. You'll never have enough time to run all the tests you can think of, so get some structure and start narrowing down the list of possible problem variables.




* For the record, the results I used in the example are made up. The example itself is real, though.


Tuesday, November 11, 2008

Slow Down and Follow Up

I talked yesterday about giving too much status, but do be careful about the inverse problem - too little status. In particular, providing status when someone's started something and you've finished it is important.

For example, today I was working on bringing up a system and it just wouldn't work. The servers kept spitting errors at me instead of doing what I wanted.

So I asked for help. One of the server guys came over and very politely looked for a minute or two and pointed out that - due to my previous messing with the system - I had managed to get mismatched versions on there. He told me how to fix it and left.

(Fast forward about 45 minutes)

I got the system up, having made the recommended changes. Awesome! I'm done here, right?



Nope.

I wrote a note to the guy who helped me. Just two lines it said, "Hey, that fix worked. Thanks for the pointer." And now I'm done.

The important part here is that I followed up. It took 30 seconds of my time and about 10 seconds of his (to read it), and now no one's wondering what happened, or if the problem was fixed, or if I was too busy to even notice that someone took time out of his day to help me. In two sentences, we've resolved all doubts: yes, it worked; yes, I notice and appreciate the help.

It's small, but it makes life around the office a bit more friendly.

Monday, November 10, 2008

Status! Now!

For a long time I was a member of the "overcommunicate" school. Basically, the theory was that if there was any doubt, better to say something.
  • On the trail of a bug that is likely to block a release? Say something, even if you haven't quite got it pinned down yet.
  • Running a bunch of tests for a high profile client issue? Say something, even partway through.
  • Got a boss who is constantly getting asked about state? Give him state near constantly.

In general this works pretty well. Your audience - typically developers and (for high profile issues) your boss and/or support and/or the release team - feels like they're not missing anything.

I'm starting to realize that may not be the best course of action, though. There are some serious downsides to always saying something:
  • The signal to noise ratio can get out of whack. If your updates aren't substantive, then they'll start to get ignored, and woe on you when you have something really important to say.
  • Going on vacation is a pain. Most people don't think to update others as often as you do, so disappointment with the coverage while you're gone is inevitable.
  • It takes time. You can be working or providing status, but not both simultaneously.

My new working theory is to set a time when I'll say something, and then provide updates at that frequency unless something truly major comes up. So, for that hot client issue we'll update with test status once a day; two updates in a day means that something major (hopefully a huge fix) has happened.

How do ya'll balance communicating enough with saying little enough to give your communications weight?

Friday, November 7, 2008

Small Tricks

Ironically, since coming to work for a storage company I've thought more about efficiency of data storage than I think I ever have. First let me admit: I'm a bit of a pack rat when it comes to electronic data. Deleting just isn't my thing. And I'm not the only one!

So, we have a number of things on tiered storage. For example:
  • test logs are stored on the machine that runs the test for 5 days, then deleted (gasp!)
  • logs for tests that failed are stored on a network server (think NAS), then backed up to archive storage (our own product)
  • logs for issues that have happened at clients are stored on a network server, then backed up to archive storage
  • generated test data, syslogs, and other non-test artifacts are stored on a network server, then deleted
In many of these cases, we have scripts that actually do the work - monitor the fullness of the file systems and then back things up. Basically, they check every half hour to see if the primary store is more than 90% full. If it is, we email the group as a notification and then start cleaning it up.

When we originally wrote the cleanup script, we wrote it to loop through and clean things out until it got below 90% full. As a result we were getting notified multiple times a day. It would clean to just below 90% and then as soon as someone wrote to it, the file system would go back over 90%, we'd get notified and the whole thing would start over.

Here's the small trick:

Notify at 90%. Clean to 80%.

We changed the script and notifications dropped from two or three times a day to once a week or so. That's a lot less email.

Small change, big effect.

What small change can you make today that just might have a big effect?

Thursday, November 6, 2008

Too Big to Be a Bug

In the vast desert of "it should work this way but it currently doesn't", you have bugs and you have future features. There are a lot of different ways to distinguish between a bug and a feature, including by whether the user would expect it to work or see it as new, or by whether implementation has been attempted. One of them, though, is a bit unusual to me...

It's not unheard of for a developer to put this statement in a bug:

"Bug XXX is too big to be a bug. Please write a story for it."

Wait, what?

This is just yet another way to distinguish between a bug and a feature. What that statement really means is:

"The amount of work required to achieve the desired behavior is large enough that I would like to get credit for it and have it tracked. So please put it in the story process."

You see, like many XP shops, for our development group time spent working on stories is fairly closely tracked. It's part of velocity calculations, it's easily visible to our customers, and it's discussed explicitly quite often. Bugs aren't. They're the "extra" work that's done in the background. Basically, bugs are second class citizens.

Now, bugs found in the initial implementation of a feature are one thing; they're noticed because they prevent story acceptance and therefore get discussed. It's regressions introduced by refactoring, or bugs that are missed in initial testing, or bugs in legacy code (that predates the story process), or bugs that fall sort of between stories (and are found in more general testing) that get treated as "extras."

Bug fixes are as valuable as stories, and should be tracked as closely as stories.

The developer who wants the bug turned into a story instinctively understands this. He's asking for credit for the work he's going to do on the bug. I think he's perfectly right, as well. Fixing bugs is development work and as such should count.

Now, the reality is that I don't particularly like turning bugs into stories. After all, how long something takes is orthogonal to what it actually is (bug or feature, for example). So instead I propose that we start putting bugs in the story queue. That's right. If a bug matters, then put it in the backlog. It's more important than some features, less important than others, and it's development work that needs to be done. Sounds like the definition of a backlog item (to borrow a SCRUM term) to me.

How do ya'll handle bugs in a process that rewards development work but doesn't always surface the time spent fixing problems?

Wednesday, November 5, 2008

Did I Find a Bug?

I originally wrote yesterday's post with a certain title:

Doesn't Work

I happened to check it today and noticed that the title bar of my browser says:

Doesn't Work

But the title of the post in the blog itself is:

Doesn't Work

Whoopsie! So, did I find a bug? Let's look at the arguments:

No way! Not a bug!
  • Obviously, what happened is that the blog software stripped illegal content from my title.
  • Who writes a title like that anyway? We've got a classic case of "Users would never do that!" (tm).
Totally a Bug!
  • There's an inconsistency between the title bar and the title in the page itself. They should at least be consistent.
  • There were no warnings or errors that the title was going to be changed when I hit save. Not telling the user alone is a bug, regardless of whether it should have changed.

I think this one is pretty clearly a bug, mostly because of the lack of user feedback that a change is being made. It's probably not an important bug, though.

Would you log it?



Update 11/5 12:40 pm:

It gets even weirder. Here's what I wrote:


And here's what got published.


Wacky.


Tuesday, November 4, 2008

Doesn't Work

This may be the worst way I've heard to start a conversation about an issue you think you've found:

"Feature X doesn't work"

Huh? This phrasing is:
  1. Pretty unlikely. For all we tease developers sometimes, it's pretty darn rare for a feature to not work at all under any circumstances.
  2. Antagonistic. Congratulations. You've basically accused the implementors of totally screwing up, quite possibly on purpose.
  3. Really hard to do anything about. What exactly is actionable about that statement? What are you expecting the person to do?
So before you go running around being imprecise and pissing people off, stop and think. Make sure you're:
  1. Being polite.
  2. Being precise about what you did and what you saw.
  3. Expressing a desired action, whether its a fix or some help tracking the issue down, or just a sounding board for a rant.
If you're talking to someone about an issue, please be careful. You'll have much better luck getting the attention you want if you approach the problem in a way that encourages people to help you. 

Monday, November 3, 2008

Toys

Engineering motivators can be a bit difficult. After all, sometimes you want to motivate people to do things (refactor code, create new features, pair program, etc). Sometimes you want to motivate people to not do things (break the build, write lots of bugs, check in without running some code first, etc). In a team environment there's the fun twist of motivating individuals and the entire team.

Like many dev organizations, we motivate with recognition.... in the form of toys.

Build Status
For the build, we have an ambient orb:


I think this one is quite common. Red means the last build failed; green means the last build succeeded; purple means it's currently building. This one is all about the team: it sits in the middle of the room and glows on all of us. After all, we all have to get that build fixed.

Serious Breakage
The second one is for the person who breaks the build, breaks the lab, or otherwise seriously compromises development's ability to work:

(Ours is a little different, since one of the figures has no head. We did stick a slip of paper with a smiley face in the neck hole, though!)

This one sits on the culprit's desk, large and truly ugly. Invariably someone who doesn't work in dev will ask, "what is that?", and whoever has it gets the added joy of explaining why this figurine is on his desk.

Bug Hunter
Sometimes you find a real doozy of a bug that totally takes down your system. To that finder goes the Fubar:
(Yes, this is  a real tool - how awesome is Stanley Tools?)

This goes to the person who discovers the deep and subtle yet really nasty bug.  I should note that it's not always a QA Engineer who has this award; developers and support can also find major issues. And in pretty much every case, it's much better to have found it in dev than in the field!

Code Shearing
For the discerning developer, we have the code shearing award:

Yup, those would be the Bolt Cutters of Deletion. To get this one, you have to find a chunk of code that isn't being used (or shouldn't be used), refactor, and delete the junk. On a large multi-year code base like ours, knowing when to delete is worthy of recognition!

What awards do you have around the office?

Friday, October 31, 2008

Who Cares?

We're in the throes of a release cycle, which leads to all sorts of fun conversations, many starting like this:

"So, is bug 123 a blocker?"

Well, that's an interesting question. Like many organizations, we have guidelines for this sort of thing:
  • if it results in data loss or corruption, it's a blocker
  • anything that makes the system crash is a blocker
  • anything that's going to create excessive support calls is a blocker
It's more subtle than that, though. If a blocker will make you miss the release date, and you have revenue riding on that release date, is it still a blocker? How about if it won't affect the customer providing the revenue?

Ultimately, for each bug the real way to understand if it's a blocker is to ask:
  • who cares?
  • what will this entity who cares do if they hit this bug?
  • what are the consequences of fixing this bug?
  • which is worse - what happens if the bug occurs in the field, or what happens if we go ahead and fix the bug?
If it's worse to fix it, then it's not a blocker. If it's worse to hit it, then it's a blocker.

Ultimately, whether an issue is a blocker depends on who your real customer is and what they will do in the "fix it" scenario and in the "don't fix it" scenario.




* An aside for the (quite large) school of thought that says testers provide information and do not make these decisions: well, I'd rather not get into that argument. After all, we're not talking about who makes the decisions here; merely about how the decisions get made.


Thursday, October 30, 2008

Too Busy For a Solution

Go read this about being so busy dealing with a problem that you never get to fix it.

Sure, we all know we need to balance today and tomorrow, but that's one of the first things it's easy to lose sight of. And yet you have time to read this blog.

So here's the deal I'll make with you. I'm going to stop this particular blog entry here and save you the 30 seconds more you would have spent reading this.

You breathe. Take the time you just saved and think of one thing you can do to make your tomorrow better. Then go do it.

Wednesday, October 29, 2008

Wacky Alternate Methods

Welcome to my anti-cygwin tirade.

First, let me say that cygwin has its place. It's great when you have a lot of utilities that are UNIX-based and you need to introduce Windows. For example, my company uses a reservation system for machines. That reservation system is a UNIX script (I'm oversimplifying slightly), and cygwin lets us use the same reservation system for our Windows machines.

But

Cygwin is a crutch.

Because we have cygwin...
  • we can mount drives through cygwin instead of standard Windows methods
  • we can copy files using cygwin rather than through drag-and-drop or Windows copy commands
  • we don't have to learn Windows
That crutch has made us weak, in particular because of the last part. Just because you can get a UNIX-like environment on Windows doesn't mean you should always use that environment. Just because you have a crutch doesn't mean you should always use it. Save the crutch for when your leg is broken and you have no other choice.

End anti-cygwin tirade. Promise!

The problem really isn't cygwin. The problem is choosing to use that instead of the native (Windows) OS wherever possible. The problem is that when you use cygwin you're not really doing what your users do on Windows - you're using wacky alternate methods. So you copied a file with cygwin. That's different than if you copy a file with Windows Explorer, and one day you're going to miss a bug because of that.

If you really do have to put something powerful and rather odd on a system (aka a crutch), that's fine. Do it. Just recognize that you're forcing something into a place it doesn't quite fit, and don't use it for anything but that one purpose. Wherever possible, use the native functionality instead. Is it more work? You betcha. Is it better in the long run? Absolutely.

It may be a bit unfamiliar, but you'll learn it, and in the end you'll be much better for having both.

Tuesday, October 28, 2008

Bugs As Records

A bug in a defect tracking system has a lifecycle:
  • it's logged
  • it's triaged
  • it's discussed
  • it's fixed
  • it's verified
  • it's (maybe) reopened
Most of the time, after a bug is fixed and verified, no one ever looks at it again.

But....

There is one other major use of a bug, and that's as a record.

When you don't know what's going on, and don't know where to start debugging an issue, the defect tracking system can be a great reference. Look up the error message you're getting and see if it sparks anything. Look at all the things you might find:
  • Maybe you have to reopen the bug (oh no!)
  • Maybe you find out that the bug was fixed a bit later than you remembered and the fix is in the next release, so you've just found another occurrence of the problem
  • Maybe it didn't happen quite that way, but it points you toward another log that has some interesting information
  • Maybe that module didn't throw the error, but another one did and the calling module is the same
  • Maybe you find out that this really has never happened before (at least that your defect tracking system knows about)
The point isn't that looking at the defect tracking system may help find a duplicate. The point is that even if it doesn't find a duplicate, looking at the defect tracking system may help you think about the bug a bit differently.  It's another way to think about solving the issue.

So when you're puzzled by a problem, don't forget that closed bugs are a resource, too.

Monday, October 27, 2008

Bug Verification Checklist

Verifying bugs is a bit of an art. It's also a time to make very very sure you're right. After all, there are only four possible scenarios:
  • You verify a bug and it's actually fixed. This is what we want to have happen.
  • You verify a bug and it's not fixed. This means you're going to find it in the field. Your customer will be unhappy AND you'll have egg on your face. Not good all around.
  • You kick a bug back to dev and it's not fixed. This is the second best scenario; a fix would have been better but, hey, at least we caught it. In the end, it's not much worse than finding the bug in the first place.
  • You kick a bug back to dev and it's actually fixed. This is where we waste time. It's a minor embarrassment and it erodes developer trust in you a bit (rather like finding a bug that's clearly not a bug).
So, with only one happy outcome, one mediocre income, and two chances for us the testers to embarrass ourselves, let's approach defect verification carefully.

Here's what I do to verify the bug:
  1. Make sure I'm running a build that has the fix in it. In particular when there are a number of branches this is something that needs double-checking. Rely on check-ins and build tags for this, not on bug comment time stamps.
  2. Repeat the steps that reproduced the issue and make sure the behavior is what I expected. This is the obvious part. I try the thing that broke before and see if the behavior has changed. If there's zero change in behavior (i.e., the exact same thing happens), I'm really suspicious - after all the fix attempt is likely to have at least modified the system behavior, even if the fix is complete.
  3. Make sure I got all the way through. I can't prove this bug is resolved unless I can prove I exercised the thing that used to cause the bug. A failure before we even get as far as the bug leaves me in a verification form of Shrodinger's Cat - I can't prove whether it was fixed or not!
  4. Look for markers of resolution. Often a bug fix will include a mark or a note that is a secondary way to know the bug was fixed. Usually this is in the form of a message that does appear in the log (XX complete) or that does not appear in the log (failure message "blah" does not appear). Look for the positive indicators of a fix - success message - in addition to the negative indicators of a fix - lack of prior error.
  5. Reread the bug. Think I got it? Great. I'm going to read the bug one more time, start to finish, especially with a really long bug. Maybe the behavior morphed over time, or maybe there is a reference to another problem that sometimes hides this problem. Maybe there's a reference to some documentation change that should be opened as a separate request. Once this bug is closed out, it's unlikely to be read again, so make sure you get everything out of it that you need.
Once you've done all that, only then can you mark the bug as verified or failed, depending on what you found.

Friday, October 24, 2008

Wabbit Hunting

There are two kinds of escalations that come in from support: those where the issue is still ongoing and those where the issue is no longer occurring but we'd like to understand what happened so we can fix whatever first caused the problem.

I like to think of these as issues that are alive and issues that are dead (and just need a postmortem).

Issues that are dead are in their own way more simple. You take the logs, the issue description, and any other information that has been gathered, you apply your 5 whys or other form of analysis, and you state what you believe occurred. Since the issue isn't ongoing, proof is difficult to come by; you're looking for the most likely cause and what you can do to prevent recurrence of that most likely scenario.

Issues that are alive - that are still ongoing - are different. Now we're wabbit hunting.


The issue is still occurring. Either it hasn't been fixed at all, or recovery has been attempted and the problem has happened again. Your goal here is different; it's not about finding ultimate cause now. It's about getting the customer running again.

To be sure, a lot of your analysis techniques still apply, but don't be afraid to start fixing. This isn't the time for a leisurely analysis. It's a time to balance analysis with action. Got a problem? Great, get that problem to stop. Then see if there's another problem. As long as you're nondestructive and you're actually looking for the cause of the problem rather than simply hiding it, getting the customer up trumps creating an elegant theory. 

"What could we have done better?" is a question for a dead issue. 
"What can we do now?" is a question for a live issue.

To stretch the analogy: Shoot the wabbit. THEN figure out how it got into your garden.