A Personal Blog
Technorati's Downtime
Seems like the Technorati team is finally getting a chance to come up for air.
What happened?
At about 9:30PM PST Friday, there was an electrical fire on the power main inside our colocation center, where our entire server infrastructure is housed. This caused our battery backup power supplies to kick in, but the independent power generator at the colo never kicked in – possibly because the problem was a fire inside the building rather than a general power blackout of the neighborhood. Well, the fire was only problem #1. We weren’t expecting or planning for an outage of that kind. It caused a cascade of other problems that made the rest of the weekend a huge PITA. Problem #2 was that we didn’t have a good enough emergency plan in place that would shut our systems down cleanly when power ran out like that. Unfortunately, that meant that when the batteries died, our server farm went down quite ungracefully – causing problem #3, which was data corruption due to the unclean shutdown.
Boy, can I relate. We maintain some fairly critical systems here (we’re about to implement our first systems which deal directly with patient care, instead of just being backend reporting).
Uptime, clean failovers and quick recoveries are things we do here (hey Dave, need an extra body? heh). The problem is that everyone wants huge uptime, but nobody wants to pay for the cost.
What is the cost?
That’s always a good question. A general approximation is “whatever you were planning for times 2.5″. What does that allow for?
Hardware: Typically when you’re looking for high availability you’re looking at 3 key components in your main server farm: critical system failover / redundancy, active / passive clustering (to failover to) and system management (like graceful shutdowns). You may want to use the active / passive cluster for your DB and then a “cluster” (software) of app servers on the frontend, as well as a DB warehouse way back.
Storage: Speaking of warehousing… Storage is always one of the biggest issues. High availability typically means SAN of some kind. You could go with a NAS appliance as well, but you really need non-local storage so you can manage that all by itself. So you’ll definitely need primary online storage, and possibly some nearline SATA type storage as well, in case of reference failures or whatever which don’t require you to go to your backup system.
Backups: Gah, backups are a nightmare these days. Seriously, in an environment this big you almost forget about backups and simply use snapshots to your nearline storage, because backing up 15TB of data just isn’t easy or fast enough and you can’t tie up any system (even if you’re backing up from the passive clusters). Then again, you need to evaluate if you even need to do “backups”. If you have a main SAN, nearline snapshots and a backup infrastructure what would arise that would require it? Always an interesting question and one for people far smarter than me to answer.
Backup System: This is always the fun bit, and is ultimately where that “.5″ comes from. This is a non-redundant mirror of your main system. Say you had a pair of warehousing servers, 25 database servers, 10 app servers and 3 interface servers in your main infrastructure. Your backup system would have a single warehouse server, a dozen DB servers half a dozen app servers and a pair of interface boxes (unless you want to 1-1 them, in which case you can get 3). You also will only have an active cluster here. This truly is your “emergency” system. It’s offsite. It’s mirrored daily, hourly, incrementally, whatever. There are a thousand ways to get this mirror happening. It has a mini-SAN. This is not meant to bear the load for more than a few days and performance will ultimately suffer while you’re doing it. You MAY want a backup solution (or nearline storage) in here purely in case you have an issue while your main system is down. Maybe.
What does all this mean for Technorati? Not a whole hell of a lot. But, it’s good to remember what types of systems and issues you’re looking at when you are designing “higher” availability systems. I’m not a high availability systems expert. This is purely “in my humble experience” type of stuff. Ultimately you don’t simply throw an extra handful of servers at a solution and expect it to stay up a significant amount of time more. You need to architect the solution from the ground up.
So, if I had one piece of advice to Dave and the Technorati crew it’d be simple: don’t react to this. Let it be a catalyst, but make sure you spell out your business needs (and funds available) first and foremost and then design a solution around that. Be prepared to have consultant costs and be prepared to have training costs (these often get forgotten). And make sure your human resources are prepared to carry the burden both of the transition to any new infrastructure and to carry the maintenance of that infrastructure. A key error many companies making is building out their intrastructure without allowing for more human resources.
Good luck to the team, and hopefully they get a chance to sleep soon enough.
| Print article | This entry was posted by Jeremy Wright on September 27, 2004 at 7:50 am, and is filed under Blogging, IT Thoughts, Work. Follow any responses to this post through RSS 2.0. Both comments and pings are currently closed. |
Comments are closed.
