UF Postings Past: Five 9s You Probably Do Not Need
November 11, 2007 – 11:34 pmBrad Feld talked about a problem quite a while back that sparked an interest with me. He posed a question as to what was a better way to describe “acceptable downtime” other than the five 9s, that is, 99.999% uptime. I saved off the whole post to my flash drive and kept the question with me, intending to come back to the problem. I myself have struggled with this on innumerable web projects over the years, and I know there’s a better way.
Let’s begin by considering why uptime is so important. Users love reliability in computers. If their PC crashes, they get upset. If they do it too much, they’ll change operating systems, buy tuning products, hire service techs, even buy whole new computers. If local applications crash, they will get rid of them and buy a new one. The same thing applies to web applications and websites, even free ones. If it is unreliable, users throw it away. The logic is that if the service is worthwhile, someone else more reliable will provide it (and they’re right). Why on earth do users dispose of websites, computer programs, and even whole computers so easily over reliability? The reason is simple- computers are replacement tools. There is little that end users do with computers that they cannot do some other way. I can write by hand, do spreadsheets by hand, databases by hand (it’s called a filing cabinet), get music by hand (CDs), get video by hand (television), mail by hand (snail mail), talk to people by hand (telephone, face to face, etc), research by hand (books, libraries). Computers, for the most part, replace other processes because they are faster and easier. If it is not as reliable as the process it replaces though, then as a user I might just go back to the tools I had before. With more experienced users, if it’s not reliable, I just wait for something that is, because I know by now someone else will take the idea, run with it, and build something more reliable. Reliability is a major way that you can lose a customer/user.
This illustrates my first point- uptime in and of itself does not matter. Being up when the user wants it to be up does matter. So I ask this question: In today’s era of web analytics, why do we believe that our systems need perfect uptime? Using web analytics, I can see what times of day how many customers are on my site. I know that if there are five user sessions between 1 am and 2 am, I have five customers in that time. I should treat these as five seperate customers whether or not they are unique users because a repeat customer is, quite frankly, as good as a new customer. Anyone who uses my service five seperate times in an hour loves it enough that they not only represent themselves as a customer but also future customers that they will recommend my service to and bring me. This is especially true in a startup world where you are trying to build market share. I also do not want to base it on page hits over sessions. If one customer is a heavy user, I kind of hate to make him mad, but on the other hand, a heavy user is a dedicated user. He will more likely wait for the system to come back online. Better to annoy one dedicated user who clicks a lot than fifty customers who were just checking out the site. Unique customer experience sessions are the key to measuring customer service.
Using this knowledge of my users’ average habits over time, I can count sessions and project them over a 24 hour graph (or by week, or month, or whatever is best for your business model). Now that I know this, I can base my uptime expectations against that chart. Armed with this, I can expect my uptime to cover a specific percentage of user sessions, such as “The system must be available for 99% of average user sessions in a given day”. It doesn’t matter nearly as much if my servers are down for seven hours if I know that less than 1% of all user traffic logs on during those hours. I care a lot more about five minutes of downtime during an hour where 78% of my website’s traffic occurs. Using this kind of logic gives real meaning to my uptime planning. It also is very practical in a worldwide economy. I know people who think it’s okay for their website to be down at 2 am. It is if your customers are in the same time zone, but what if you have a sizeable foreign userbase? For that matter, what if your userbase happens to be a lot of night owls? You need to know when your site is getting hit, and be up during those times.
There is a second thing to consider in calculating ‘uptime’. Quality of Service must be considered as well. All bad customer experiences count against you. An unacceptably slow website experience can be said to be just as bad, therefore, as true downtime to an end user. I recommend setting a minimum response time for your website or web application and, if you have the capabilities, monitor its response times through automated tools. Poor response times should be counted against your uptime statistics equally to true downtime. If you do not have access to these sorts of tools (and they’re not simple to implement and can be expensive), then stick to pure downtime for your equations.
This system is, of course, not perfect. If you are planning a major marketing push, you must adjust for the increased web traffic and plan your marketing information releases against your server traffic. Don’t announce a major release of new features during peak uptime, for example, because the increased traffic may tank your servers, and if you plan a major release during a lull in your usage, you need to adjust your chart for predicting usage so that your admins realize uptime will be measured differently while the push is on. You also have no control over some press, say, for example, if Digg or Slashdot suddenly tells the world to go look at your neat new service and send you 10,000 hits in an hour.
This idea may need some adjustment and tweaking still. It’s somewhat of a shot from the hip right after inspiration struck. Please, if you know of ways to refine it, comment on. I’d love to see a discussion started that fleshes out what is hopefully a good alternative to the burdensome “five nines” way of doing web business today.
Like this post? Buy me a cup of coffee.Popularity: 15% [?]




