Friday, January 30, 2009

Control Risk Through Exposure

We've been working on a new hardware platform.

This is awesome in a number of ways:
  • New hardware can help us increase our performance
  • Decreased end-of-life worries
  • New... and shiny!... toys are always fun.
This is scary in a number of ways:
  • checking to see if hardware actually works - reboots consistently, doesn't have massive amounts of dropped packets, doesn't freeze, etc. - takes weeks and can't really be shortened
  • confirming we can get enough hardware to keep up with our development and testing needs (funny, no matter how many we get, we can always effectively use more!)
  • if something isn't right, it won't be fixed quickly - weeks or months rather than hours or days - and this will blow the whole schedule out of the water
So, we're in a situation where we have a lot of promise and a lot of risk.

Don't hit that panic button just yet.

Let's do some risk analysis. And let's write it down. It doesn't have to be much. Just let your imagination run lose and write down all the things you can think of that could go wrong. The sky's the limit - everything from "it might be too heavy and the floor in the lab might break" to "there may be some horrible subtle disk firmware bug that causes systems to randomly freeze".

And then for each one, write down what you're going to do when it comes true. What software will you write? What processes will you put in place? What checks will you do? What backup vendor will you contact?

And then for each one, write down how you are going to prevent it. What tests will you run? What agreements will you make with a vendor? What stress tools will you use?

Sometimes the best way to stave off the risk is to simply write it all down, and then write down exactly how you're going to conquer it.

You win.

Thursday, January 29, 2009

Choose Your Reaction

We all have bad days. Whole companies have bad days - or weeks, or months, or even years. With all the layoffs and down economy lately, it's really easy for work to become a very angsty place.

So you can give in. Get territorial and defensive. Freak out about losing your job and start hoarding your knowledge so that they need you. Work long hours and get snappy because you're getting tired and/or burned out.



Or don't. Don't get me wrong, this particular downward spiral is easy. But boy will it kill you and your career.

So decide not to.

Instead of hoarding information, find new ways to share it. 
Instead of worrying out loud, be enthusiastic.
Instead of working extra long hours, work more efficiently during your normal hours (close that RSS reader!).
Choose not to engage in battles that aren't about the product and making it better.
Choose not to engage at a personal level but only at a professional level.

Remember that sometimes you win by not being the one who has to have the last word.

And every day, find a way to get yourself and someone else to smile. It's still okay to enjoy your work.

Wednesday, January 28, 2009

Problem Types

I've been thinking about scheduling and risk lately. Taking development and test together, there are some features that are far more risky than others, and I've been trying to figure out how to classify and explain this for others - product management, project management, etc. Most of these are things we just know - adding a field to an existing form is less risky from a scheduling perspective than trying to make the whole form load three times faster. That is, we can have more confidence in an estimate for the first than for the second.

I've started breaking down new features (or work efforts, to be precise) into three separate areas:
  • Fully understood problems
  • Partially understood problems that can be worked in parallel
  • Partially understood problems that are linear
Fully Understood Problems
These are the easiest and less risky. Going into this, you know everything that needs to be done to get it out the door. It's usually in an isolated or at least well-designed piece of code, test coverage is good, etc. Often these are problems that are iterating on a current feature or piece of functionality. These are well-understood because we have essentially done them before.

For these, you can do an estimate and be pretty confident that you'll make it. Your risk in these scenarios comes under non-product areas (developers getting sick, build system dying, etc), and those are risks that you can quantify based on your past development history.

Examples: adding a field to a form, creating the fifth skin for an application, adding a new type or class of data, etc.

Partially Understood Parallelizable Problems
These types of issues are more risky than fully understood problems. Going into this, you know that you don't have all the information you need, and lack of information increases your risk. However, these are problems that can be approached from multiple directions at once.  This means you can increase your knowledge and decrease your risk to a certain extent by throwing resources at the problem.

For these, you need to add an additional padding to your estimate. You can go ahead and do an estimate based on your history of these types of problems, but be aware that optimistic estimates are not the best route here. On top of your usual risk profile, you have the "there be dragons and I have an army" factor.

Examples: a new hardware platform, a new feature written in house, etc.

Partially Understood Linear Problems
The last type of problem to address is a feature or work effort that you don't quite understand and that needs to be worked on linearly. These are the features that you can't attack from a lot of different angles - you fix a problem to uncover the next. Throwing resources at this problem is unlikely to help; it's the classic baby problem (one woman can have a baby in nine months, but nine women cannot have a baby in a month). The best mitigation strategy here is to reduce your cycle time as much as possible; the time between uncovering a problem, identifying it, and fixing it should be shrunk as much as possible, because you don't know how many of these cycles are between your team and their goal. These problems also respond well to a "war room" approach, in which you create a dedicated team to solve this problem with all needed resources - dev, test, IT, etc. - and (physically) isolate that team so this is all they work on.

These problems have potential to be farthest from your original estimates and the hardest to fully estimate. Again, you can look at your team's history based on these types of problems, but you need to assume that it will take longer than you think. On top of your usual risk profile, you have the "there be dragons and I am but one man" factor.

Examples: performance improvements, build failures, etc.


So when you're creating your schedule and assessing the risk, ask yourself what kinds of problems you're facing. If you can classify the kinds of problems you can solve, you can determine how solid your estimates are and therefore how risky your schedule is.

Tuesday, January 27, 2009

The Little Things...

There's a component of our product that has it's own command line. When you first access it, you are presented with the following text:

Welcome to Foo. Type help to get it.

That always puts a smile on my face. It tells you the one thing you really need to know, and it does it quickly and cheerfully. (And no, I have no idea why I find that particular phrasing so cute.)

It's the little things that make the day brighter.

Monday, January 26, 2009

Know They Know

There are three types of knowledge related to information about your project:
  • Knowledge your audience has
  • Knowledge you think your audience has
  • Knowledge no one has
The first thing to notice here is that your knowledge doesn't matter. What matters is what your audience knows and what you can tell them that they don't already know. Knowledge your audience has (and presumably that you also have) is great. Knowledge neither you nor your audience has is straightforward - you can go get it. The danger area is knowledge your audience doesn't have but that you think your audience has.

This is where that ugly "miscommunication" business word rears its head.

Remember that date you changed and thought you told everyone? That's knowledge you think your audience has. Make sure. Mention it again.

Remember that third-party component you need that isn't quite released yet but you're going to have to take the risk to get the functionality? That's knowledge you think your audience has. Make sure. Outline both the situation and the resulting risk and mitigation.

No one's saying your audience has to know where you went for your last team lunch, or who wrote up what bug. But there are things that you need to make very sure are communicated, and then communicated again. That's the knowledge you can't afford to think your audience has. You need to know your audience knows.

Friday, January 23, 2009

What You Heard First

Things change.

(Ha! There's a profound statement!)

Release dates change, hardware configurations change, performance results change. This is all just fine - within reason it's just part of doing business (and not just software - happens in other industries, too!). This is fine for the team that's involved in the work. They know the project in great detail and are constantly immersed in the current state, current spec, current date, current performance. But.... 

There's a whole other group to consider: those who are peripherally involved in a project. Think upper management, support, sales, other development groups, etc. They need to know your dates, configurations, performance results, but they're not constantly immersed in it. I've noticed that in general these groups tend to remember the first thing they hear. So things change, and you tell these groups, and then a week later, you hear, "wait, I thought it was on [the old date]?"

So how do you keep people who are partially involved up to speed?

I don't know of a foolproof way to do this, but there are some things that help:
  • No verbal commits. Make sure it gets written down somewhere.
  • Keep things in one visible place. Have a project page available to everyone and make sure it's kept visible and current. Having a meeting with upper management? Use the project page as your notes. Having a meeting with support? Walk through the project page.
  • Keep a separate history. Once things change, the second question is "why?" (right after question 1: "What is changing?"). Make sure your project page contains a discreet list of what changed and why. I've gone as far as having a page with:
                 Release Date: March 20 (why?)
    And the "why" is a link to a page showing the history of the old date, the cause of the change, and the new date. It helps show that this is change for a reason rather than change for change's sake.
  • Get your story straight. Make darn sure you know what you're talking about before you spit out a date or a number. And make sure the entire core team agrees with you. Inconsistencies will simply be rejected and not remembered, or thrown out as a probable mistake.
Ideally, your first reporting remains accurate to the end of the project. In reality, that doesn't happen very often at all. Make sure that as things change you're keeping everyone on the same page, and doing it in a way that encourages the core team and other affected people to know and remember the most recent state.

Thursday, January 22, 2009

Don't Forget to Test

It's been quite a week. There have been a lot of meetings, a lot of emails, a lot of reporting. This afternoon, I get to test again. I feel a lot better when I can keep my hand in. When I keep testing, I can:
  • help my team get through crunch periods
  • know what I'm talking about in detail when I get asked about the software
  • keep my skills fresh
Boy I'd hate to give that up. No matter how manager-y you are, don't forget to keep testing!

Wednesday, January 21, 2009

Change It Up

When we're looking at what tests to perform on a given release, one thing is obvious very quickly - we could go on forever! At some point, though, one of two things will happen: (1) the software will get released out from under us; and (2) spending more time testing will not give us the value we need to get from that time (the law of diminishing returns takes effect). So we need to order the testing we're doing according to some logical dimension. We need to do the tests that give us the so-called biggest bang for our buck first.

And how do we pick those tests? By the quantity of bugs found, and by the worth of the bugs found.

Huh?

Quantity of bugs found is a fairly straightforward one. If you have a set of tests that flush out a lot of your bugs, running those early helps. The earlier you get those bugs into dev (and management), the earlier you can react to them. So test the target rich areas of the application near first.

Worth of bugs is a bit more complex. Some bugs are simply things that your customer doesn't care about, or wouldn't see. Other bugs are "kill your business" kinds of issues - crashes, data loss, that kind of thing. We want to find high value bugs - those that are worth a lot to your customer - early. So run tests that produce high-worth bugs first.

Reconciling the two priorities is always interesting. One thing we have to do before every release is validate which tests and techniques in our arsenal are likely to produce many bugs, which will produce worthful bugs, and which will produce both. This will change by release. If an area of the product doesn't change between releases, a technique that found lots of bugs there in the previous release is unlikely to find lots of bugs there again. So we go hunting in another area of the product, or through another technique. If there's one lesson, it's this:

Repetition is unlikely to produce equally valuable results. For maintain test effectiveness, change either the software under test or the technique used.

For example, if I ran a boundary value analysis technique against a form in the last release, and that form hasn't changed in this release, I won't use that technique first. I'll attack that particular form with another technique. I may return to the boundary value analysis technique, but I'll do it later, since I'm not likely to find as many or as worthful issues with it.

Don't solve the same problems with the same methods every time. Vary your problems and vary your methods, and you'll find things you never expected.

Tuesday, January 20, 2009

Chicken Little Meets Risk Analysis

No matter how hard we try, we ultimately are exposed to risk. We're writing software, and at least in some way this project/system/release is not like anything that's ever been done before. There are so many many variables - people, code, languages, feature and the understanding of features, holidays, snow or other inclement weather, sick days, heck, even building power! - that ultimately boil down to "what can we ship when and how well will it work?"

So what do we do?

The basics of risk analysis are something that have been written about extensively (there's even a whole society for this), and it comes down to:
  • identifying possible threats
  • determining the likelihood of each threat occurring
  • quantifying the impact of each threat
  • managing your risk - determine what can be done to minimize the likelihood and/or impact of each identified threat
This is great. But... ultimately it comes down to your gut. We'd all like to think that we go into risk assessment meetings that sound like this:
"Our employee unanticipated time off ratio is 10% for this group of resources. So for a team of 10 people, you can guess that 9 of them will be there and working each day."
"Okay, so our threat is employees not being present and working, our likelihood of that threat is approximately 100%, and our impact is 10%. So let's add 10% to our current estimates."

Instead, we often get into risk assessment meetings that sound more like this:
"Well, it's the flu season, so let's go ahead and add 10% or so to the schedule."

We're all just working from our guts. We can prepare by looking at similar situations, but ultimately I don't think any of us has invented a time machine or mastered the art of seeing the future. Some of us are better than others, and over time experience gives us an apparently uncanny ability to make reasonable guesses quickly.

So if we can't see the future, and we don't often have the good data we would like, how do we do risk analysis in a reasonable manner? Of course each project is different, but in general there are a few things that I've seen help:
  • Make this a group problem. Different people bring different perspectives and opinions. Oh, and make sure everyone is an active participant, not just a listener.
  • Don't be afraid to be a little bit Chicken Little.  In general there will be a lot of optimists in the room, so adding a dash of pessimism can provide some balance.
  • Don't be too Chicken Little. The point of risk analysis is risk mitigation - acknowledging the risk and then doing something to prevent or minimize it. If all you do is scare everyone, then you'll never get to the mitigation step.
  • Figure out what risks don't change and learn to handle those. Although each project has different risks, some (like our employee sick day example) are the same across multiple projects. For those kinds of risks, invest in measurement of actual occurrence so that you can refine your mitigation strategy over time. That previous sentence, by the way, was mostly a fancy way so say, "learn from your mistakes".
While acknowledging that risk is not something that can be fully contained, it's incumbent on us as software professionals - developers, testers, managers - to identify and mitigate risk. We have to at least try to meet our features and our dates, and reducing risks is one of the tools in our arsenal. 

Friday, January 16, 2009

Fun Docs

Okay, so I'm on a documentation kick. I'll stop eventually.

Writing documentation is only the first part of preserving how/what/why/when/etc. The second part is using and reading the documentation. Making sure it's available to everyone  matters, indexing it well so people know what's there helps, and pointing people to this great resource you've created is essential. But once the document is in front of the person, you still have a hurdle to overcome: are they reading? Or are they snoring?

After all, which document would you rather read?

The Official Version
=================================
Ensure the following fields are completed when logging a defect:
  • Summary: This field should contain a one-sentence identification of the perceived issue. Use the active voice and describe the actual problem instead of the perceived solution.
  • Component: Select the affected areas of the product. Consider the root cause analysis that has been done as well as the initial presentation of the issue.
  • Assignee: Select the person currently responsible for triage for the primary affected component.
  • Description: Identify the perceived problem, the expected behavior, and the individual steps required to reproduce the issue. If this includes a requirement that the behavior be produced on certain environments, be sure to describe the relevant portions of the environment and note the areas that are irrelevant.
Etc...
===============================

The "We're All Friends Here" Version
===============================
When you log a bug, there are some required fields:
  • Summary: One sentence to capture the essence of what's wrong. This is likely to become the bug's nickname, so make it good.
  • Component: The team that gets the issue, at least at first. Setting more than one component is okay, just explain why in the description.
  • Assignee: The lucky winner of the "you got a new bug" prize. This is the person on triage for the team you think will have to fix something. You can only pick one.
  • Description: Explain the problem, what behavior you expect, and how to reproduce the issue, paying particular attention to times and machines (which will point to appropriate parts of the logs). Well-written descriptions will be rewarded with more prompt attention.
Etc....
===============================

Neither of them is the most interesting document ever. But the second one is more fun to read, and that will get you a lot more readers and a lot more comprehension. Completeness matters, but pomposity doesn't. Don't be afraid to be a bit more casual, and a lot more direct. Make your document the embodiment of a conversation rather than The Definitive Documentation(tm). We are, after all, engineers here, not lawyers. So have fun with it!

Thursday, January 15, 2009

Multiple Audience Documentation

Due in part to my own memory (or lack thereof), I write a fair amount of documentation. And I make my team write a lot of documentation (so I don't forget what they tell me!).  Usually the things we write are fairly quick and easy:
  • workarounds to bugs
  • bug descriptions
  • quick guides for how to run scripts or configure tests
We're working on switching defect tracking systems (the old one no longer really meets our needs well), and needed to document that. What's interesting here is we have two audiences:
  • New users (new employees, possibly customers, etc)
  • People who used the old system (developers, support, etc)
One audience needs a fair amount of information. New users need to know what an "environment" is, or how we determine priority, etc. It's not an overwhelming amount of information, but it's not short. The other audience - existing system users - doesn't need all the same information. They already know how to set priority, for example. Instead, they need to know what's changing. What's different about the old and and the new system?

The obvious answer here is to write standard documentation: here's how to create a bug; here's how to resolve a bug. But... that's a turn off for your existing users.

It's insulting to be told something you already know. (After all, you're basically treating them like they're more ignorant than they actually are.) And what your documentation is doing is telling your existing users what they already know.

So how do you handle this?

The easiest way that I've found is to go on and do the documentation for your new users. Then add a section that describes the changes. This can be done effectively as a separate section ("Experienced users start here!") or as sidebars in each section. Think quick hits for your savvy audience, and then make them easy to find.

The point here is that you need to accommodate both audiences explicitly, not just write for the people who need the most information. It will make all of your users feel welcome, and that will make your documentation a lot more effective.

Wednesday, January 14, 2009

Hard Pairs

Let's start by positing that we're doing some form of pairing. Further, let's posit that we're pairing for more than just for writing code. So, then, we should be able to pair on about anything.... right?

I don't actually know if that's the case. I do know that I find some things a lot harder to pair on than others. I'm good with pairing when we're: doing an exploratory test session (especially if I'm not the one driving); or teaching/learning something concrete, like how to install the software; or doing general coding. I have a really hard time pairing in some other situations, though.

This is a bit long, but bear with me through this example. We have been working on some code to interact with Jira. This is a classic thing you pair on - we had some base code and we needed to add the ability to set the "affects version" when the ticket was created. But this particular thing was extremely frustrating. It should have been a really simple problem to solve but the documentation is sketchy at best and it just wouldn't work. We wound up spending a good amount of time with Google, staring at Data::Dumper outputs for similar data structures (especially Jira's "components"), and ultimately just poking at the darn thing. Eventually it worked, but it was very hard and very frustrating. Oh, and it took both of us to solve it, so in the end it's probably good we were pairing. I won't bore you with another example, but there are a few other times when I've been working in a pair with someone and just found it more frustration than I would expect.


So what makes some pairs so hard?

I know in part what is NOT the problem:
  • The people I'm pairing with. I can pair with one person and do great, and pair with that same person another time and have it be a hair-tearing experience.
  • The setup. Again, this one is the same across pairs, even when some are easy and some are hard.
There are only two things that I have figured out that make pairing consistently hard.

Documentation
When I'm writing the first draft of a document, I can't do it while pairing. The words simply won't come, the structure is all kinds of skewed, and there is just no cohesion. This has to do with the writing flow state, I think. I simply can't type as fast as I can think, so the words in my head are a good 5-10 seconds ahead of what the other person can possibly hope to see. Stopping to let the other person (and my fingers) catch up just breaks the train of thought. And we wind up with a mishmash of short thoughts that just doesn't hang together. Those short thoughts represent the only times we were truly thinking together, and it's in spurts.

I should note that revising documentation is a whole different story. That really easily benefits from multiple eyes.

Breadcrumbs
In the example I described above, one of the things that was the most frustrating is that we were doing a lot of what I'll loosely call research. There was a lot of Googling and following threads, tickets, documentation, APIs, and then trying something based on the slim leads we found. The trouble was that there were a million ways to go with this, each legitimate. And when I do something like this on my own, I'm moving fast. (I'm a pretty good data sifter.) I'm following breadcrumbs of solutions or partial solutions or failed solutions, attacking the problem from different angles, and constructing a representation of the path to the solution in my head and revising it constantly between my code and the various resources I found. Having to take someone along with that thought process is very difficult.

This is exacerbated, I think, by the fact that when I'm in this mode it's because we're working on a problem that's harder than I expected it to be, so I'm already a little frustrated ("I should be able to solve this already!").

I don't know how to make these pairs easier. Practice, in part, I think. Learning how other people solve these kinds of problems might help. Slowing down and walking other people through the techniques I use is probably a useful step.

Definitely more thought required here...

Tuesday, January 13, 2009

Not My Job

I'm a QA Lead. My job description involves things like hiring and maintaining a team of testers to meet the company needs, acting as an information source for the team that determines whether and how to ship releases, providing estimates for test work involved in everything from new features to performance improvements to new hardware platforms, etc.

I have several QA engineers working with me. Their job descriptions include performing exploratory tests, analyzing the output of automated nightly test runs, creating tools and utilities to generate data and analyze logs, etc.

That leaves a vast array of things that are not our job.

My first question is, "who cares?". We work in a small company. We're here because we want to do exciting and interesting things. We're not here to make fiefdoms. And that means that job definitions are a bit incomplete. If it needs to be done, and you know how to do it, then it's your job. Congratulations!

I don't have a lot of patience with people who declare that something is not their job.  If you have the time to make that statement, then you have the time to say something more helpful, like, "I'll do it" or "I don't know how to do that. Hey, Fred, can you help?".

So quit telling people what isn't your job, and start saying and doing something useful.

Monday, January 12, 2009

Leave It Alone

I tend to think of testing as a very active thing: I am doing something to the system. Maybe I'm installing it, maybe I'm writing data to it, maybe I'm logging in to it, who knows... that point is that I'm doing something.

This misses an entire class of tests.... tests in which I do nothing.

Most systems, including the one I work on, are not ever really idle. Sure, the end user may be doing nothing at all, but the system is still doing things internally. It's doing integrity checks, heartbeat is running, it's rebalancing data, it's logging system information, etc. Your system probably does something else, so substitute your own background goings-on here.

So what happens to my system when I just leave it alone for a day? A week?

Things that I look for here:
  • Memory usage. You find a lot of slow memory leaks here because you can't dismiss the increased memory utilization as a consequence of system usage.
  • Space usage. Like memory, but for disk space.
  • Transition to idle. As the system goes idle - winds down from the last thing you did - what does it do? Do queues go to zero? Does memory utilization stop? Does the system start throwing errors because it's depending on utilization for monitoring?
  • Flushing. There are often queues and reserved space, and actions that are triggered when you have "2% used space" or "100 items in the queue". What happens when you stop adding new things? Do those processes still work on the last few items, or does the system fail to handle it's last requests?
Sometimes when you're not using a system, you really are doing a test. Go ahead, leave your system alone for a bit.

Friday, January 9, 2009

Everyday Heroics

I come home from work most days and talk with my husband or my friends about how my day was. You know, the typical, "how was your day, honey?" kind of thing. Regardless of whether I'm the one asking or the one answering, the response follows a pattern - typically a story or a few stories about funny incidents, or things that "I just couldn't believe it!".

And there's a theme....

You are the hero of your own story.

Most of the time, you'll tell stories in which you're the hero:
- A bug you found that would have caused huge problems in the field. (phew!)
- That innovative duct tape sling to get the fan at the right angle so the servers stopped overheating. (very MacGyver of you)
- Your first board presentation, even though that's usually a job goes to people of a much higher pay grade. (wear a nice outfit and know when to stop talking)
- the witty comment you made at lunch (okay, so it was a slow day)

And there's a further theme... none of these heroics are large. We're not talking burning buildings and damsels in distress. We're talking everyday situations in which you did the right thing, or maybe went just a bit above and beyond. These are everyday heroics. These are the things that we can do every day to make ourselves, our product, our teammates better.

So when you go to work today, think about the story you want to tell tonight. What do you want today's everyday heroics to be?

Thursday, January 8, 2009

Double Check

The systems we work with every day are multi-machine systems - up to 40 machines in one system. We have several of these around, and keep some of them running for weeks or months. In particular, we have one set of systems that basically always stays up and keeps getting upgraded as we move through releases. And we have a list of machines that are reserved for that purpose.

We went looking through the machines that are reserved for that purpose today and found several machines that we thought we were using.... and we weren't actually using. This is actually good for us, because we  thought we were out of machines. But boy does it pay to go check that what your lists are telling you!  Just like we double check bug fixes, we also need to double check resources we think we're using.

Today's lesson:
Lists, even updated lists, aren't always accurate. Test those, too!

Wednesday, January 7, 2009

Parsing Questions

Today I asked the following question:

"What is the outcome you're looking for with this bug?"

Let's parse this for a bit.

How the Question Was Asked
Notice that the question wasn't about what needed to change. It wasn't about what the fix was. It was about the customer's expectations.

Ask the question in such a way that the user gets to choose from a variety of answers. Sometimes an explanation or a document can solve the issue as well or better than code. Other times, the user really does want a fix.

Outcome
I specifically asked what the outcome will be. Occasionally, a resolution (aka a stop to the bug) is something that will be really hard to figure out. Sometimes the customer just wants to complain. Venting and complaining are fine, but a bug is not a good place to do that. There needs to eventually be some resolution, even if it's just, "I hear you. We're not going to do anything about it right now, but I understand that it bothers you." (This is typically the resolution type "won't fix").

You're Looking For
It doesn't matter what the recipient of the bug wants or needs. It matters what the customer wants or needs. It sounds straightforward, but often a bug will trigger the memory or idea of another bug or a feature request. A customer may log a bug that they can't download files of certain names and you discover on the way to looking at it that all files over size X fail to download; those are two different bugs. This is just a quick reminder that you're talking about your customer's request, not your request.

With this bug
It's easier to call it a bug or an issue. Don't get hung up on parsing whether it's a bug or not; that's just a rat hole. It's something that you're working on. Call it "fred" if it makes you feel better!


Asking the question is great, but how you ask the question can have an effect on the response you get. So think first, then ask.

Tuesday, January 6, 2009

Selenium Saga - Part III

This is the last in the series of posts loosely called "So you wanna run Selenium Grid and a Ruby app using VMWare...". I'm putting this all in one post, but I've linked to the older ones for reference.

This is how you do it:
  1. create Selenium tests in the manner of your choosing. How you get there is not important, as long as it responds to something like rake test:selenium.
  2. Go configure your machines, including the remote control and the hub.
  3. Set up your rake task to run the selenium test against each browser.
  4. Set up a runner script to update your code and call the task.
Create Selenium Tests
I'll leave this as an exercise to the reader. I use the selenium gem.

Configure Machines
There are a number of machines to configure: the test runner, the VM server, and the VM clients. Note that these instructions are for a Linux machine.
Configure the Test Runner
The test runner machine runs the Selenium Hub, the script that controls the overall grid run, and the project code.
  1. Designate a user that will run the grid scripts. This user is going to need privileges on other machines, so choose accordingly.
  2. Install Java (be sure it's the JDK). Confirm your installation by opening a terminal and ensuring that typing "java -v" gives you version information.
  3. Install Apache Ant (including setting JAVA_HOME).  Confirm your installation by opening a terminal and ensuring that typing "ant" gives you an error about missing build.xml and no other errors.
  4. Install Selenium Grid. Unzip it and place it in a logical location (/usr/local/bin is where I put it). Be sure that your user has execute privileges on this location. In these instructions, we'll call this ${GRID_HOME}
  5. To be sure everything is running correctly, open up a terminal, cd to ${GRID_HOME} and type: ant sanity-check. You should see text indicating that the build was successful.
  6. Generate an ssh key for your user. Make sure the key has an empty passphrase.
  7. Add the following line to your user's .bashrc so that the key is added when a base session is started: ssh-add ~/.ssh/id_rsa
  8. Create a folder to hold your grid artifacts. This can be anywhere on the system as long as the user has rwx privileges on this folder. I generally put it in the user's home directory (e.g., /home/user/grid). We'll call this ${ARTIFACTS}.
  9. Place the following things inside ${ARTIFACTS}:
  • run.rb
  • A folder called "code"
  • controls.yml
Configure the Remote Control Server
  1. Install the ssh public key for your user from the test runner. Do this by scping the id_rsa.pub key to the machine, and catting it to authorized_keys in that user's home directory.
  2. Ensure that you have permission to run vmware-vim-cmd (you may need sudo for this)
  3. Configure as many clients as you think you will need. Each client needs at least 512MB RAM (and preferably 1 GB) and at least 15GB of disk space beyond what the OS needs.
Configure the Remote Control Client
This should be done on each remote control client individually. Note that I'm assuming a Windows client.
  1. Install the OS and ensure that it is NOT a member of a domain.
  2. Install Java (be sure it's the JDK). Confirm your installation by opening a terminal and ensuring that typing "java -v" gives you version information.
  3. Install Apache Ant (including setting JAVA_HOME). Confirm your installation by opening a terminal and ensuring that typing "ant" gives you an error about missing build.xml and no other errors.
  4. Install Selenium Grid. Unzip it and place it in a logical location (c:\ is where I put it). Be sure that your user has execute privileges on this location. In these instructions, we'll call this ${GRID_HOME}
  5. Create a batch file that cds to the ${GRID_HOME} and runs:
    ant -Dhost=foo.example.com -Dport=5556 -Denvironment=*firefox DhubUrl=http://myhub.example.com:4444/ launch-remote-control
  6. Put that batch file in a folder on your computer (I put it in c:\logonscripts)
  7. Share the folder containing your batch file as NETLOGON (using Properties->Sharing... on the folder). Name matters here; you must use NETLOGON.
  8. Set up automatic script running during logon
    • Go to Start -> Run and type “gpedit.msc”
    • In the Group Policy Editor that opens, select Computer Configuration -> Administrative Templates -> System -> Logon
    • Double click on “Always wait for the network at computer startup and logon” and choose “Enabled”
    • Double click on “Run these Programs at user logon” and select Enabled.
    • In the same “Run these Programs…” dialog, click the “Show…” button, then “Add…”. Put in the full path to the batch file.
  9. Set up Windows to automatically log the user in on boot. Be sure the user that is logged in is the one that runs the batch file at logon. The total effect here is that when you start the machine it will automatically log in and start the Selenium Remote Control (cool, huh?). I use TweakUI to do this but there are several ways, so choose the one you like best.
Set Up Rake Task
Set up the rake task like you did in my earlier post. This will get you 90% of the way there.

For the last 10%, if you're using physical machines for your Remote Controls, well, you're going to have to solve that one yourself (hope you have controllable power!). If you're using VMWare (or Xen or the like), then just power the appropriate machine on from the back end. Here's my trick to do it.
  1. Create a file called controls.yml (remember this? We put it in our ${ARTIFACTS} directory. This should describe for each browser whether that browser is present. It should give the name of the VM (as it's called by the VM software), and it should give the platform that it's running; this is the platform you will feed the rake task. This example is a Windows machine running Firefox 2:
    one:
    vmname: foo
    OS: win
    Firefox 2: true
    Firefox 3: false
    IE 6: false
    IE 7: true
  2. Add this method to your rake task and call it to start your remote control. Note that this is for VMWare; your syntax will vary for other products.
    def startRC(platform,browser)
    vm_hash = YAML.load_file("controls.yml")
    # Get a list of VMs for this browser
    vm_hash.each { | key,value |
    if value["OS"] == platform && value[browser] == true
    # We found a machine that fits our criteria
    cmdBase = "ssh user@vm_server sudo vmware-vim-cmd -U user -P password"
    vmId = %x[#{cmdBase} vmsvc/getallvms | grep #{value["vmname"]} | awk {'print $1'}]
    isOn = %x[#{cmdBase} vmsvc/power.getstate #{vmId} | grep Powered] #get the state
    if /on/ =~ isOn
    #In use by someone else.
    else
    #Start it and give it a bit to come up and log in to the hub
    system("#{cmdBase} vmsvc/power.on #{vmId}")
    sleep(90)
    break
    end
    end
    }
    end
  3. Reverse the method to stop the VM when you're done with it.
Set up Runner Script
Update your code, start the server, restart the hub, and then call the rake task. This is going to be quite specific to your OS, source code repository, etc., so I'll leave this bit for you. 

Gotchas
There are about a million gotchas here - this isn't a small project to undertake. Some of the biggies are:
  • Make sure your SSH keys are on all the machines; there's a lot of ssh-ing back and forth.
  • If you're sudoing, make sure it does not prompt for a password.
  • Be careful of running this in parallel. It looks easy - just add "fork" - but there are problems particularly around changing test data used in multiple tests.
  • The Selenium Hub basically can't recover from Remote Controls hanging or otherwise "going away". So if you have a test run that fails you're going to want to restart the hub and all your remote controls before continuing.

All of this was culled from documentation, blog posts, forums, discussion threads, you name it; I'd like to send a big thank you to the internet at large for helping me get all of this going. 

And good luck with your project!


ETA: By the way, if you need help or have updates, please feel free to comment or contact me privately. I don't promise to know the answers, but maybe we can work through it together.

Monday, January 5, 2009

Leap Second

Look what I found in the kernel log of a Linux system:

Dec 31 18:59:59 portal-01 kernel: Clock: inserting leap second 23:59:60 UTC

I knew we added a leap second this year, but I didn't know it would get recorded as such.

That's awesome!

Friday, January 2, 2009

Bug Bash

Sometimes, your bug list gets away from you. Maybe you're at the end of a rough sprint and you've got a lot of things that are... well, close to working. Or maybe dev's been in bug fixing mode and there are a lot of bugs to verify. Or maybe there has been a lot of .... umm... feedback from the field, all of which came in as bugs. Ever how you got there, you're behind on your bugs.

Time for a bug bash.

What's a bug bash? Well, it's pretty simple. We the team - and it's always the team, not a person - have a lot of bugs to get through. And nothing major is wrong; there's just a lot of little cleanup work to be done. So we're all going to put aside that major feature we're working on, or that background task we're doing, and we're all of us going to work on bugs.

Forget, just for a moment, your defect tracking system. Forget your task list. Forget your backlog.

Grab a bug, and fix it. Repeat.

This is a timeboxed activity. You're going to have a two hour bug bash, or a one day bug bash, no more than that. And you won't sort, you won't think, you won't categorize. Pick up a bug, do your thing (fix, verify, whatever), and put it down.

That's a bug bash.

To do this, you'll need:
  • a runner. This is the person filling your backlog, usually a project manager or other admin type. This is the person with a pool of bugs. You stick your hand up for another one, and the runner hands you the next thing. Any prioritization is done by the runner, and is entirely outside the concern of the team working on the issues.
  • your team. No matter how much people are working on, everyone gets to work on the bug bash. You guys are the ones doing the work. No guessing on priority, no choosing areas that are good to work on, no guessing whether this really is important. Just take what you're given and fix it.
It helps if you order in lunch, too. Being together - preferably in one room - is important. It keeps you honest: no one can slip off for a meeting, or go grab lunch, or just do "one little thing" on a feature. You're all together, and you're doing bugs. It helps, too, if you don't look at the defect tracking system summary page. Just ask for a number and look at that bug only. It's less distracting that way.

Technically this can apply to any discrete task, but it works really well for bugs, since they tend to accumulate.

Good luck!