Ian penned a nice little follow up to this scalability / reliability / availability / uptime issue: What is 99% Uptime Anyway?

In it he gives a brief view of what different “9s” of uptime means:

Uptime Time lost in a year
98% 7.3 days
99.0% 3.7 days
99.9% 8 hours
99.99% 1 hour
99.999% 5 minutes

And he goes on to say that “uptime” is such a crappy metric. It doesn’t take into account responsiveness, dropped pages, when something was down, etc. He proposes a new metric for companies, one that I totally agree with:

I think companies should define a metric more along the lines of: the time take to complete XXXX operation, between the hours 9AM and 9PM. and then combine these timings into a weighted average. The weights being how important that operation is to your core business.

measure & monitor that. not uptime.

If some of the companies I’ve worked with recently had done the above, you can bet their average score would have been incredibly low – even though their “real uptime” would have been fairly high.

After all, Google Analytics was “up” during its recent rush. Servers were responding, pages were being served, stats were being tracked… But, it wasn’t very responsive or useful.

Good job Ian.