Thursday, December 15, 2011

Dependency and Redundancy

My software is, of course, perfect. (HA! but go with it). The trouble is that the software I write uses other software; I have external dependencies. For example, Textaurant uses Heroku for hosting, Twilio for text messaging, New Relic for monitoring, and Airbrake for error aggregation.

There are a several good reasons to have external dependencies like these, including:
  • Development speed. Using providers for pieces of the solution lets us focus on the core value we provide rather than the common.
  • Better reliability. People who specialize in monitoring, for example (like New Relic), are going to be a lot better at monitoring than someone who doesn't think about this all the time.
But.... there is a big downside.

When your external provider goes away, you're in big trouble. If Heroku goes down, my app is unavailable. If Twilio goes away, I won't be sending any text messages. It doesn't matter that the outage is on the service provider end - to my customers, it's just the application they use not doing what it's supposed to. And that's my problem.

So we have dependencies, which are really useful and also introduce risk. What can we do about it?


Let's take a simple example. My hosting provider had an outage on December 3rd that took down our application. What could have prevented us from being unavailable, even as Heroku was having a No Good Very Bad Day? We could:
  1. Add a second hosting provider, writing into a shared database (or db cluster). Properly load balanced, the hosting provider who was not affected could simply have taken over all hosting duties.
  2. Fail over to a secondary hosting provider as soon as we realize we're down, again using a shared database or database cluster.
  3. Use local data storage in the browser to allow users to keep working. It wouldn't provide full functionality, but it would have given us 85%+ of our features, which is a lot better than simply being down.
There are two common themes running through these options: redundancy and cost. We can increase redundancy.... as long as we're willing to pay for it. How far you go toward ensuring redundancy is tempered by how much time and money you're willing to spend. In the end, it's up to you and to your particular needs. Just consider your dependencies... before they go down and make you consider them in a panic!


  1. Catherine - big business may be able to afford secondary and tertiary systems of redundancy, but small companies usually cannot. We hope our external service providers are taking those steps on our behalf, perhaps for a few more $$ a month. But even the big customers are missing the point by blindly throwing redundancy into the platform and calling it good.

    The truth is, end-users will tolerate some amount of outage - particularly if the service is free, or deemed a convenience (non-vital). We should ask our customers - "what is your expectation of service availability?" - is it BLINDLY 100% (the world isn't perfect, as you state). Let's not forget the qualitative requirement for availability. Sometimes we miss out on determining the end-user perception and impact of such an outage only because we never bother to ask a real customer. It's amazing to me how often we forget that the majority of our customers are human beings who also can read, write, think and respond - they're not just blips of data in the analytic logs.

    Another idea to consider is taking a video camera(or just Qik on your iPhone) to a real customer location and do some live interviews of real customers using Textaurant service - ask them real questions and capture their real answers. Make it a part of your promotional campaign. Give away a t-shirt.

    People love free t-shirts. :)

  2. More valuable stuff Catherine -- and, as you often do, you point out a gaping hole at the product management level that trickles down to the project, dev, test, support, & analyst level.

    At the end of the day, there's a risk-based business decision that needs to be made & (in my experience) even when that decision *is* made, the follow-on decision is forgotten, which is "regardless of how much I'm going to spend to minimize this risk (a.k.a. risk mitigation), how do we want to handle the situation where [bad thing] happens anyway?"

    This is a risk control question.

    In the situation you present, I suggest finding *some* way to inform users of what is going on -- whether that be a custom error page (hosted elsewhere) triggered when site is unavailable or displaying error X, or an automagic email to users, something wickedly more complex and expensive.

    Turns out, informed users are (almost) *always* less grumpy than uninformed ones.