Wednesday, November 26, 2008

Smarter Different, Not Smarter More

One of the standard questions I get asked by candidates is:

Why do you like to work here?

This is the candidate equivalent of "tell me your greatest weakness." Hey, all sides in interviews get our cliches!

And just like any well-prepared candidate, I'm prepared for that stock question. The answer is pretty much always:

Because I really like working with people who are smarter than I am.

Sounds pretty good, right? Flattering to the candidate (you, too, could be in such a smart group!), a cliche answer to a cliche question (keeps the candidate comfortable), and is an honest answer (I have a lot to learn from my coworkers).

Except it's wrong.

"Smarter" doesn't really mean anything. And I don't honestly want to be the dumbest person in the room (talk about a blow to the ego!). What I really enjoy is working around people who can teach me something. I also enjoy working around people to whom I can provide insight and information. I'm happiest if it's a two way street - we're all learning!

What I really want is to work around people who know different things than I do. And then I want us to teach each other.

Yeah, that's what I like about my job.

Tuesday, November 25, 2008

Tricky Time

I was looking at logs yesterday, and we noticed something odd in the syslog (I've edited this to make the problem more obvious):

Nov 18 12:03:16 portal-02 postfix/master[454]: 
Nov 18 12:03:20 portal-02 kernel: nfs
Nov 18 17:03:27 portal-02 apphbd[2763]: WARN: 
Nov 18 17:03:27 portal-02 apphbd[2763]: WARN: 
Nov 18 12:03:53 portal-02 kernel: nfs:
Nov 18 12:03:54 portal-02 kernel: nfs
Nov 18 12:04:17 portal-02 postfix/master[454]:
Nov 18 12:04:17 portal-02 postfix/master[454]:
Nov 18 17:04:26 portal-02 stunnel[24095]:
Nov 18 17:04:26 portal-02 stunnel[24122]:
Nov 18 17:04:26 portal-02 stunnel[24095]:
Nov 18 17:04:27 portal-02 stunnel[24122]:
Nov 18 17:04:27 portal-02 stunnel[24125]:
Nov 18 12:05:18 portal-02 postfix/cleanup[10163]:
Nov 18 12:05:18 portal-02 postfix/cleanup[10163]:

See it?

The time is "jumping around". It starts at noon-ish, and there are some entries at 5pm, and then some more at noon-ish. Very weird.

It took a good couple hours to track this down. And I should note that this is a Debian Linux syslog.... have you figured it out yet?

.
.
.
.
.
Hint time:
There are two things you need to know:
  • That time stamp is the local time of the process that has the event.
  • Processes set their time zone (their local time) when they start.
.
.
.
.
Got it yet?

The time zone (/etc/localtime) was changed after boot. Any processes that were restarted  - apphbd and stunnel, in our example - got the new time zone. Any processes that didn't restart stayed in the old time zone.

Seems simple once you know what's going on!

Monday, November 24, 2008

Reconciling Plan Length

I - and many of my friends in software - work in fairly short cycles. Two weeks is the most common, really. Whether they call it SCRUM, Agile, XP, or "what we do", the process works basically the same:
  • An ordered list of things to do is provided to development
  • Development estimates the work
  • Dev and the customer (or customer proxy) draw a line at two weeks and that's what dev signs up for
This works pretty well in the short term. However, translating this to a longer work schedule is more difficult. With this process, how do you know whether you're on track for a deliverable that's 5 months away?

Okay, let's start with the process zealots: Yes, your customer should be committed to the process you're using, and should be working with the two week cycles.

And now in the real world.... we have more than one customer, and these customers have requirements and cycles that are larger than the two week development cycle. Ultimately, they need to plan development and rollout cycles in months or even years. 

I haven't actually solved this problem; I don't know how to reconcile the need to have a feature in three months and rolled out in six months with the two week planning.

Things we've tried, with various degrees of success:
  1. Stick it at "about the right place" in the backlog based on velocity and adjust a bit as it gets closer. This one often winds up with starting it a bit too late due to an overoptimistic idea of how long it will take.
  2. Create an earlier task to estimate the item, then stick it in the right place based on velocity and estimates. This works a bit better but still suffers from optimistic estimates.
  3. Put the item early in the backlog so there's plenty of time. In practice this is tenuous if you have more than one client or more than one thing going on.
What else have you tried to reconcile your development cycle with a client's longer-term plans?


Friday, November 21, 2008

You're Doing It Wrong

The last refuge of the process zealot is "it's not working because you're doing it wrong".

I've read several blogs recently, and been at or had friends I trust at several companies, and they all keep coming back to: "It's not working!"... and the answer from proponents of that process (XP and SCRUM mostly) is "Well, you're doing it wrong."

That's not helpful.

If someone's "doing it wrong", tell them that. And then tell them what they're doing wrong and help them fix it. Otherwise, you're just part of the problem.

Thursday, November 20, 2008

Where You Write It

I've written before about things that are "just known". This is what we call institutional memory, but the problem is, institutional memory is transient. People come, people leave, people forget.

One of the ways to get around the problem of things that are simply "known" is to write them down. That's not the only trick, though. There are lots of places to write things down:
  • Email: This one really doesn't work very well. It requires either the sender or the recipient to still be there later and remember that the information is available.
  • Document: This is better than email, but make sure you store it somewhere centrally. Keeping it on your laptop has the same problem as email - others can't get to it.
  • Document on a Central Server: This is accessible, but not the most easily searchable.
  • Document in a CMS or Wiki: This is generally the easiest to update and to search. However, formatting isn't the best.
In the end, as long as it's accessible to the entire team, pretty much any method works. Just make sure it's consistent.

Friday, November 14, 2008

Not a Mind Reader

Late in a release cycle, there's an inevitable conversation:

"You mean you just found THAT?! That's been wrong for a long time! Oh, we've GOTTA have that one fixed."

Okay.

There are two things  that really grate about that statement:
  1. Age of bug does not correlate with defect priority
  2. If you've known about it, why isn't it in the defect tracking system?
Bug Aging and Priority
A bug's age has nothing at all to do with its priority. A really old bug can still be low priority. Conversely a brand new bug can be low priority. Bugs can change priority through time, but that's not age specifically. Instead, a bug changes priority as usage patterns and features around the issue change.

For example, if you have a bug in your Active Directory integration and you're selling into UNIX shops, that bug is probably low priority. When your sales team starts landing major customers who have mostly Windows environments, that bug might become higher priority. Why? Because now you're more likely to hit it in the field.

If You Knew....
There are a lot of different groups that might define a particular behavior as a bug - sales, support, development, QA. Just because one group finds a bug doesn't mean another group has any clue that the defect exists. Even if the behavior is known, one group might assume that behavior is correct while another group considers that behavior absolutely ludicrous.

Enter the defect tracking system. This system is not the exclusive domain of QA, or even of development and QA. Here's the amazing thing: anyone can enter a bug! So if support feels that something is a bug, they should enter it. Same goes for sales, development, QA, anyone. From there the bug can go into a standard triage process. But if the bug never gets in, it's not going to be fixed. 

I am not a mind reader.


So if you'd like a bug fixed, great. All it takes is two simple steps:
  1. Log the bug
  2. Explain why you think it's high (or changed)  priority.
If you don't do those two things, don't expect other people to automatically know there's a problem and fix it. Be an active part of the process; your results will improve immensely.

Thursday, November 13, 2008

It Hurts When I Do This

In test there are good days when everything does pretty much when you expect it to. And there are bad days when it seems like no matter what you do - even something you did just fine yesterday - just doesn't work. 

On those days, being in test is kind of like being that guy who goes to the doctor and says, "Doctor! It hurts when I do this!".................. all day long.

It's a recipe for frustration by noon.

So calm down. Grab logs and whatever diagnostic information you need. Go for a 5 minute break. And then re-baseline your system and start over fresh. You, my friend, have entered a bad state and continuing to try won't make it better. Oh yeah, and anything you do find in this state is highly likely to be hidden by fixing the issue that got you into this bad state. If you have your root issue, you're not helping. You're just repeating the thing that makes you say, "It hurts when I do this."

And don't forget the last part of the joke; it's relevant:

Don't do it any more!

Wednesday, November 12, 2008

"Word Game" Analysis

You know that word game where you change one letter at a time to turn a word into another word? Each middle step must also be a word, and you can only change one thing at a time.

Like this:

HATE
HAVE
HOVE
LOVE

Turns hate into love (yeah, I know, it's a cliche, but it was an easy one).

When you're trying to track down a problem and your initial analysis is getting you nowhere, it's time to start eliminating variables. To do this in a systematic manner, try the word game. How do you get from their config to your config? One change at a time... oh, and each one must be valid (a "word" in our analogy).

For example, dev and QA were seeing very different results on the same test. So how do we get from dev's environment to QA's? Let's lay out our "word" from dev to QA:

Version       Fullness    Type             Data                Encoding     Size
HEAD          0%             from code    10 GB file A    M                 4 node
.
.
.
.
.
4.2               75%             from CD     35 GB file B      E                 16 node

Now we have a map. Instead of randomly trying configurations, we're going to walk from dev to QA, changing one thing at a time and seeing what the results of our tests are. It's not about trying every configuration. It's about giving yourself a structure so you can hone in on a solution with as few different steps as possible. It's a way to look for progress and identify areas that are likely to cause a problem.

Let's look at our "word" again, this time with test results:

Version     Fullness     Type            Data                 Encoding     Size              Result
HEAD       0%               from code   10 GB file A     M                4 node          15
4.2             0%               from code   10 GB file A     M                4 node          15
4.2             0%               from code   10 GB file A     E                 7 node           13
4.2             0%               from code   35 GB file B     E                 7 node           12.5
4.2             0%               from CD      35 GB file B     E                 7 node           8  
4.2             75%              from CD     35 GB file B      E                 7 node           7.2
4.2             75%              from CD     35 GB file B      E                16 node          7.2

Looking at our output, we now start to get an idea of what areas are actually making a difference in performance. In our example, it turned out that installing from CD gave you a bit of a different configuration than running from code. Fix the CD installer (the code was correct) and the results for CD-based configurations started matching the code-based configurations.

Use this word game technique when you're facing a lot of variables and no real indications as to what is important and what is not. You'll never have enough time to run all the tests you can think of, so get some structure and start narrowing down the list of possible problem variables.




* For the record, the results I used in the example are made up. The example itself is real, though.


Tuesday, November 11, 2008

Slow Down and Follow Up

I talked yesterday about giving too much status, but do be careful about the inverse problem - too little status. In particular, providing status when someone's started something and you've finished it is important.

For example, today I was working on bringing up a system and it just wouldn't work. The servers kept spitting errors at me instead of doing what I wanted.

So I asked for help. One of the server guys came over and very politely looked for a minute or two and pointed out that - due to my previous messing with the system - I had managed to get mismatched versions on there. He told me how to fix it and left.

(Fast forward about 45 minutes)

I got the system up, having made the recommended changes. Awesome! I'm done here, right?



Nope.

I wrote a note to the guy who helped me. Just two lines it said, "Hey, that fix worked. Thanks for the pointer." And now I'm done.

The important part here is that I followed up. It took 30 seconds of my time and about 10 seconds of his (to read it), and now no one's wondering what happened, or if the problem was fixed, or if I was too busy to even notice that someone took time out of his day to help me. In two sentences, we've resolved all doubts: yes, it worked; yes, I notice and appreciate the help.

It's small, but it makes life around the office a bit more friendly.

Monday, November 10, 2008

Status! Now!

For a long time I was a member of the "overcommunicate" school. Basically, the theory was that if there was any doubt, better to say something.
  • On the trail of a bug that is likely to block a release? Say something, even if you haven't quite got it pinned down yet.
  • Running a bunch of tests for a high profile client issue? Say something, even partway through.
  • Got a boss who is constantly getting asked about state? Give him state near constantly.

In general this works pretty well. Your audience - typically developers and (for high profile issues) your boss and/or support and/or the release team - feels like they're not missing anything.

I'm starting to realize that may not be the best course of action, though. There are some serious downsides to always saying something:
  • The signal to noise ratio can get out of whack. If your updates aren't substantive, then they'll start to get ignored, and woe on you when you have something really important to say.
  • Going on vacation is a pain. Most people don't think to update others as often as you do, so disappointment with the coverage while you're gone is inevitable.
  • It takes time. You can be working or providing status, but not both simultaneously.

My new working theory is to set a time when I'll say something, and then provide updates at that frequency unless something truly major comes up. So, for that hot client issue we'll update with test status once a day; two updates in a day means that something major (hopefully a huge fix) has happened.

How do ya'll balance communicating enough with saying little enough to give your communications weight?

Friday, November 7, 2008

Small Tricks

Ironically, since coming to work for a storage company I've thought more about efficiency of data storage than I think I ever have. First let me admit: I'm a bit of a pack rat when it comes to electronic data. Deleting just isn't my thing. And I'm not the only one!

So, we have a number of things on tiered storage. For example:
  • test logs are stored on the machine that runs the test for 5 days, then deleted (gasp!)
  • logs for tests that failed are stored on a network server (think NAS), then backed up to archive storage (our own product)
  • logs for issues that have happened at clients are stored on a network server, then backed up to archive storage
  • generated test data, syslogs, and other non-test artifacts are stored on a network server, then deleted
In many of these cases, we have scripts that actually do the work - monitor the fullness of the file systems and then back things up. Basically, they check every half hour to see if the primary store is more than 90% full. If it is, we email the group as a notification and then start cleaning it up.

When we originally wrote the cleanup script, we wrote it to loop through and clean things out until it got below 90% full. As a result we were getting notified multiple times a day. It would clean to just below 90% and then as soon as someone wrote to it, the file system would go back over 90%, we'd get notified and the whole thing would start over.

Here's the small trick:

Notify at 90%. Clean to 80%.

We changed the script and notifications dropped from two or three times a day to once a week or so. That's a lot less email.

Small change, big effect.

What small change can you make today that just might have a big effect?

Thursday, November 6, 2008

Too Big to Be a Bug

In the vast desert of "it should work this way but it currently doesn't", you have bugs and you have future features. There are a lot of different ways to distinguish between a bug and a feature, including by whether the user would expect it to work or see it as new, or by whether implementation has been attempted. One of them, though, is a bit unusual to me...

It's not unheard of for a developer to put this statement in a bug:

"Bug XXX is too big to be a bug. Please write a story for it."

Wait, what?

This is just yet another way to distinguish between a bug and a feature. What that statement really means is:

"The amount of work required to achieve the desired behavior is large enough that I would like to get credit for it and have it tracked. So please put it in the story process."

You see, like many XP shops, for our development group time spent working on stories is fairly closely tracked. It's part of velocity calculations, it's easily visible to our customers, and it's discussed explicitly quite often. Bugs aren't. They're the "extra" work that's done in the background. Basically, bugs are second class citizens.

Now, bugs found in the initial implementation of a feature are one thing; they're noticed because they prevent story acceptance and therefore get discussed. It's regressions introduced by refactoring, or bugs that are missed in initial testing, or bugs in legacy code (that predates the story process), or bugs that fall sort of between stories (and are found in more general testing) that get treated as "extras."

Bug fixes are as valuable as stories, and should be tracked as closely as stories.

The developer who wants the bug turned into a story instinctively understands this. He's asking for credit for the work he's going to do on the bug. I think he's perfectly right, as well. Fixing bugs is development work and as such should count.

Now, the reality is that I don't particularly like turning bugs into stories. After all, how long something takes is orthogonal to what it actually is (bug or feature, for example). So instead I propose that we start putting bugs in the story queue. That's right. If a bug matters, then put it in the backlog. It's more important than some features, less important than others, and it's development work that needs to be done. Sounds like the definition of a backlog item (to borrow a SCRUM term) to me.

How do ya'll handle bugs in a process that rewards development work but doesn't always surface the time spent fixing problems?

Wednesday, November 5, 2008

Did I Find a Bug?

I originally wrote yesterday's post with a certain title:

Doesn't Work

I happened to check it today and noticed that the title bar of my browser says:

Doesn't Work

But the title of the post in the blog itself is:

Doesn't Work

Whoopsie! So, did I find a bug? Let's look at the arguments:

No way! Not a bug!
  • Obviously, what happened is that the blog software stripped illegal content from my title.
  • Who writes a title like that anyway? We've got a classic case of "Users would never do that!" (tm).
Totally a Bug!
  • There's an inconsistency between the title bar and the title in the page itself. They should at least be consistent.
  • There were no warnings or errors that the title was going to be changed when I hit save. Not telling the user alone is a bug, regardless of whether it should have changed.

I think this one is pretty clearly a bug, mostly because of the lack of user feedback that a change is being made. It's probably not an important bug, though.

Would you log it?



Update 11/5 12:40 pm:

It gets even weirder. Here's what I wrote:


And here's what got published.


Wacky.


Tuesday, November 4, 2008

Doesn't Work

This may be the worst way I've heard to start a conversation about an issue you think you've found:

"Feature X doesn't work"

Huh? This phrasing is:
  1. Pretty unlikely. For all we tease developers sometimes, it's pretty darn rare for a feature to not work at all under any circumstances.
  2. Antagonistic. Congratulations. You've basically accused the implementors of totally screwing up, quite possibly on purpose.
  3. Really hard to do anything about. What exactly is actionable about that statement? What are you expecting the person to do?
So before you go running around being imprecise and pissing people off, stop and think. Make sure you're:
  1. Being polite.
  2. Being precise about what you did and what you saw.
  3. Expressing a desired action, whether its a fix or some help tracking the issue down, or just a sounding board for a rant.
If you're talking to someone about an issue, please be careful. You'll have much better luck getting the attention you want if you approach the problem in a way that encourages people to help you. 

Monday, November 3, 2008

Toys

Engineering motivators can be a bit difficult. After all, sometimes you want to motivate people to do things (refactor code, create new features, pair program, etc). Sometimes you want to motivate people to not do things (break the build, write lots of bugs, check in without running some code first, etc). In a team environment there's the fun twist of motivating individuals and the entire team.

Like many dev organizations, we motivate with recognition.... in the form of toys.

Build Status
For the build, we have an ambient orb:


I think this one is quite common. Red means the last build failed; green means the last build succeeded; purple means it's currently building. This one is all about the team: it sits in the middle of the room and glows on all of us. After all, we all have to get that build fixed.

Serious Breakage
The second one is for the person who breaks the build, breaks the lab, or otherwise seriously compromises development's ability to work:

(Ours is a little different, since one of the figures has no head. We did stick a slip of paper with a smiley face in the neck hole, though!)

This one sits on the culprit's desk, large and truly ugly. Invariably someone who doesn't work in dev will ask, "what is that?", and whoever has it gets the added joy of explaining why this figurine is on his desk.

Bug Hunter
Sometimes you find a real doozy of a bug that totally takes down your system. To that finder goes the Fubar:
(Yes, this is  a real tool - how awesome is Stanley Tools?)

This goes to the person who discovers the deep and subtle yet really nasty bug.  I should note that it's not always a QA Engineer who has this award; developers and support can also find major issues. And in pretty much every case, it's much better to have found it in dev than in the field!

Code Shearing
For the discerning developer, we have the code shearing award:

Yup, those would be the Bolt Cutters of Deletion. To get this one, you have to find a chunk of code that isn't being used (or shouldn't be used), refactor, and delete the junk. On a large multi-year code base like ours, knowing when to delete is worthy of recognition!

What awards do you have around the office?