David, from 37signals (who I have nothing but respect for), takes me to task for saying people need 99.999% uptime.

Problem is I said nothing of the sort. I simply ranted about companies that have hit popularity and don’t have a damned clue how to keep their systems up once they’re there.

Some people have commented that “adding an extra server might push the problem away for 3-6 months). But we’re talking about services that are doubling every 4-10 weeks. The cost of maintaining that kind of patchwork growth can go from 5K/month to 250K/month very, very quickly.

Which is why designing for scale is so important. I don’t believe any startup “needs” to achieve anything more than around 2 9s of uptime, which is what a properly configured server should do for you. However, even at the beginning, you need to be coding and planning for growth.

Small things like managing how transactions occur, having separate database connections for reading and writing, making your app able to handle variable state sessions, etc are key.

One of the cores of my post was the “ladder to high availability”. Let me repeat it here:

Backups, Redundant, Failover, Cluster, Distributed, Grid and finally Mesh

Your first step should be backups. That way if something goes down, you can have it up again in 20-200 minutes. That’s “acceptable” downtime, as long as it doesn’t happen a lot for most businesses. Redundant should mean that you can bring a new box up in 5-10 minutes. This, obviously, means you need a second box though, which is a doubling of your cost from your original setup.

Failover tends to be more expensive, as it often means having a means of replicating not only data but also static files in near-realtime. But, your downtime with this is in the seconds range (per incident). Clustering drops that downtime further (while adding cost by a factor of 1-3), distribution makes the cost sky high but gets you into that 4 ish 9s of availability, and grid / mesh and other advanced tools will put you over the top.

At no point did I say startups needed to even have backups in place. I simply said if your business requires you to be up, you’d damned well better be up. And if you know from the start of your business that its survival requires you to be up, you’d damned well better be planning for it from the start. Otherwise it will not only be more expensive to add this technology later, it’ll also sideline you for weeks at a time while you redo that technology, bring up test systems, run them in parallel and then do switchovers.

Run your company however you see best, but be forewarned that if your business relies on availability, and the reason you weren’t able to deliver it was because you didn’t think about it or plan for it … Well, your users might just have a nice little revolt.