Late at night last weekend, as Hurricane Sandy was beating the crap out of the eastern seaboard, I received an e-mail message from lower Manhattan. You may have received this message, too, or one just like it. It felt to me like getting a radiogram from the sinking Titanic. An Internet company was running out of diesel fuel for its generator and would shortly be dropping off the net. The identity of the company doesn’t matter. What matters is what we can learn from their experience.
The company had weathered power outages before and had four days of diesel fuel stored onsite. They had felt ready for Sandy. But most of their fuel wasn’t at the generator, it was stored in tanks in the building basement — a basement that was soon flooded, the transfer pumps destroyed by incoming seawater. It was like a miniature Fukushima Daiichi, not far from Wall Street.
The company felt prepared but wasn’t. There was no way to get fuel from the basement to the generator so they were staying online as long as possible then shutting down.
There are obvious lessons here like don’t put the pumps below ground level. Good pumps could draw well enough to be placed above the maximum historical flood stage. But that’s not all that’s wrong with this scenario. Why was the company dependent on a single data center?
It makes little sense for any Internet business to be dependent on a single data center. With server virtualization it is possible to put images of your server here and there to cover almost any failover problem. Not just multiple servers but multiple servers on multiple backbones in multiple cities supported by multiple power companies and backed by multiple generators. We do that even here at I, Cringely and we’re known to be idiots.
Or you could rely on the cloud. The simple idea here is that your application is deployed across tens or hundreds or thousands of server instances in data centers all over the world and even a nuclear attack on Wall Street would have little to no effect on un-irradiated users.
But are you sure about that?
The problem with clouds is that some of them aren’t very cloudy at all. Cloud computing is for some providers more of a marketing term than anything else. What if your cloud is really a single data center in a single city and the pumps in the basement have failed? How cloudy is that?
Last summer Amazon had a major EC2 (cloud services) outage that affected many customers including Netflix. There are long and convoluted stories about how this outage came to happen involving fires and generators and load balancers and mishandled database updates — stories that sound a lot to me like the dog ate my homework — but whatever went wrong it was isolated to a single data center yet somehow took down Amazon’s entire cloud for the U.S. east coast.
That cloud wasn’t a cloud at all but a data center — a single point of failure.
For some companies Sandy was a validation of good network design and emergency planning. For others it was a data disaster. And for the rest it was probably a stroke of luck that they, too, didn’t go down.
Even if nothing happened at your company as Sandy blew through, if you don’t know exactly why you were unscathed, now might be a good time to investigate.
the basics were around in the 1990s… multiple servers in multiple places in multiple locations of the US, on different carrier networks. databases if needed were massively paralleled with a sync tie line. the way this one worked in the 1990s was multiple MX records in the routing system, like biggiedata01 10.200.200.10 biggiedata02 142.98.98.33 and so on. if a request can’t find 01 in NYC, the next request will go to 02 in atlanta, and so on. eventually the net gets smart. until then, customers hitting reload try the next location in the list.
but you always have blooming idiots who nod at the fire chief, and put the fuel tanks in the basement, along with the power distribution panels, because appealing a fsking idiotic declaration in a fscking flood zone takes several months extra, and a bunch of lawyers trying to make the sensible argument of a 6 year old that basements flood.
Another lesson: don’t build a data centre on a flood plain (or earthquake zone, or hurricane zone, etc.) Yes, it is impossible to avoid every imaginable catastrophe, but there are plenty of places much less prone to natural disasters than others. Manhattan is not one of them.
Yeah, but it doesn’t matter how well you plan/legislate/micromanage, something else always comes along to screw things up. Years ago, a friend of mine organised a small rock concert. He hired the bands, PA, catering staff, and a nice venue near Blackfriars bridge in London. It was on the ground floor of a mid-size office complex where, just as we were beginning the set-up, some workmen cut through the mains supply with a mini-digger. All the power went off. And then came on again when the generator cut in. And then went off again half an hour later when the generator failed.
I went down to the plant room to investigate. There was a very large V12 diesel installation, good for at least a megawatt, with elaborate ventilation ducts and a big fan to draw air through a radiator which formed one wall of the room. Everything was working except the fan. The engine was overheating and shutting itself down. It couldn’t be fixed in time, the gig had to be cancelled, and everyone turned away and refunded. My friend had to sell his car and beg his parents for a loan to bail him out.
It think you just gave an example of a case where someone failed to micromanage enough. I have a rarely needed backup generator, yet every six months I check or change the fluids and startup battery and run it for several minutes to make sure it’s working. If the fan wasn’t working it would be immediately obvious.
Every six months? Try a complete failover exercise once per month.
The other factor that makes it reliable is that there is no “automatic fail-over” to fail. When the power goes out, the generator must be started manually and also the transfer switch. Admittedly, this is not suitable for unmanned situations that cannot tolerate any downtime at all. But it’s ideal for most residential and business situations, where power isn’t needed if no one is around.
Reminded me of this:
http://xkcd.com/908/
🙂
How mission critical are most websites anyway. The same thing happens when, for instance, Anonymous takes down the department of justice website. Who exactly surfs there and for what?
If you think of a business website as a robotic 24/7/365 storefront it comes down to dollars of income per hour uptime, but if there is a catastrophic flood, or as-mentioned a nuclear strike on Wall street, maybe you have something else on your mind than keeping your virtual business running. I could imagine a system running amok generating thousands of orders, meanwhile you’re bailing out your factory and have no means of meeting the demand anyway. In that case, if the system goes down maybe that’s a good thing.
In my opinion the most important services to keep going are radio and telephone. Radio because it works off battery, and you probably have batteries, and telephone because most phones don’t need power to work. I am old school, I don’t put much stock into cell phones.
Phones do require power just not necessarily at your home. If the power to the CO goes down and they have the fuel tanks in a flooded basement you will lose power to your traditional phone.
More and more landline phones these days are VOIP rather than POTS. FiOS customers typically have an 8 hour battery backup in their home for their phone.
I work for a company that has offices in San Francisco and New York. Our office in Manhattan never lost power and we had network connectivity with the New York office the entire time. However, both of these offices use a cloud based IP phone system, and they had a lot of their equipment in one of the data centers that went down. Our phones were still able to dial in and out during the outage, but our voicemail was down (though you couldn’t tell that when you called, the voicemail just went into the ether), and our conference bridges were unusable. Also we couldn’t log into the web portal for the provider or connect to their website that gave status updates.
I was just amazed to hear that people have data centres in lower Manhattan. Space there is expensive, as is power, and access can’t be easy. How does that make any business sense at all?
So while US companies are throwing local staff on the scrap head by outsourcing jobs overseas, the machines are kept living the high life in some of the country’s most exclusive zip codes. Go figure, as you guys say …
It must’ve been Bob’s automated stock trading server that he leases ultra-close to wall street. It enables him to eek those extra fractions of a cent from his daily billions in stock trades!
Actually, data center space is about the only thing other than a bagel that IS cheaper in Manhattan. Everyone fled Manhattan data centers after 9/11, and for awhile people couldn’t build enough raised floors in New Jersey and Pennsylvania to meet the demand. I’m in the financial biz, and if I told clients or prospects that I maintained data center space in Manhattan I would fail every due diligence check done on me.
(As it was, one of my data centers near the Meadowlands almost had flooding issues when a levee failed, and had to fall back to generators only. But yes, I did have another one that had no issues at all.)
That aside, while some things can be replicated “in the cloud”, if your system involves several terabytes of data, a large pile of compute nodes and application servers, and an involved and expensive job control system, then no, it isn’t easy to just put something “in the cloud”.
If you’re building raised floors in you datacenters, you’re hopelessly mired in the last century before you’ve even started.
It ain’t easy building a cloud. We built redundant data centers a few years ago, only to find services dead when one went down because the fail-overs weren’t implemented properly. Took about 3 disasters to get all the bugs out, and we’re a pretty small operation. My point is that even if your cloud provider has redundant sites, unless they’ve been rigorously tested they may not be all that redundant. And testing live systems with paying clients on them can be pretty scary.
So, the only tests were the actual disasters?
Humm . . . has your cloud been audited?
Response A: It’s a competetive world. Disclosure of the information for a useful audit would have far too great a risk of leaking to the competition to be allowed. Similar to disclosure of the materials used in fracking for auditing of safety concerns.
I think that wins over Response B: It’s a competetive world. You can audit my system because I know it’s so bulletproof that you’ll sign right up and pay the higher prices that demand for such quality.
Never mind that technical blather, the accounting department selected Joe.singlespace.chn for a lot better price.
Only Response: It’s a cooperative world made rotten by silly, selfish games, i.e. competition.
Even worse precious lines of mice used in research were drowned. It will take many years to create them again.
https://www.slate.com/articles/health_and_science/science/2012/11/animals_drowned_in_sandy_nyu_medical_research_is_set_back_years_by_dead.html
I guess companies like Gawker Media have learned their lesson. They were down for a day or two and only now are back up by posting to tumblr. They must be losing a fair bit of dosh not being able to run adds on their sites.
On a non cloud note, could Mr Cringly comment on the nuclear power plant shutdowns due to Sandy? How close to panic stations is losing 4 of the 6 water circulation pumps at the Salem Unit 1 site?
Spreading unrelated items around (eg random websites) is no big deal. Spreading the same service amongst multiple centers is considerably more difficult. (If it wasn’t everyone would just do it.) For example when a write happens do you wait for all data centres to acknowledge before proceeding? If the answer is yes then you just added a huge amount of latency. If your answer is no then you just added a whole bunch more things that can go wrong even when things are fine (eg network partitioning).
Here is the computer science behind it http://en.wikipedia.org/wiki/CAP_theorem
In practise about the only time you ever get to realistically test your distributed setup is when something actually goes wrong. Given all software has at least one bug, there are so many different systems from low level IP routing through high level, and you have lots more of it connected over flaky connections it is unsurprising problems happen.
Statistically you have fewer failures the fewer systems and connections you have. The consequence of failure is far higher though.
Bang on. Proper redundancy is far harder than most people seem to assume.
Many smaller firms will have a higher overall uptime by just putting all their eggs in that one basket. That perhaps offends some, but they’re probably more busy arguing about the citing of data centres and whether raised floors are the way to go or not (from reading the other comments here).
Having spoken to a few CTO types at start-ups, the problem isn’t that they have decided to site their entire business in ‘the cloud’ – it’s that they’ve done it thinking ‘the cloud’ doesn’t fail.
Single-engine planes experience engine trouble at half the rate of two-engine planes.
My car has 2 in-line sixes. When one engine fails, I can limp along on the other.
Is it true that Romney is going to outsource the Electoral College to China?
This is why there are DR and COOP consultants. You walk by your server several times a day; we’ll notice that it’s on the floor or under a sprinkler head. And there -are- standards, it’s possible to audit against the standards, and it’s possible to demand the latest audit report from your proposed cloud provider.
For life-safety applications such as backup power in a hospital there are no excuses, none, and the responsible parties at NYU Medical Center are lucky no one died and that they escaped certain criminal prosecution.
Hire one of us; the mice you save may be your own.
I read of volunteers carrying diesel fuel up 18 stories by bucket-brigade to keep servers running, and I’m thinking wft? This storm wasn’t a sudden surprise, it was huge, and everyone was saying for days it was likely to be devastating.
Are those servers really that important?
If so, shouldn’t they have been backed up off-site?
If they weren’t already backed up, shouldn’t they be doing that now and planning to shut down until services are restored?
Makes the old adage “an ounce of prevention is worth a pound of cure” take on a very literal meaning.
I’m also shocked SquareSpace’s setup is exclusively Manhattan, but the lengths they went to to keep things going is amazing. It makes me want to be a customer of theirs (if they’ve learned their lesson and invest in a redundant facility).
Another thing: Test your systems AND PLANS before you need them! I work for a contractor on a government contract out in Chantilly, VA, the Westfields area. The client has a major data center which is Extremely Mission Critical. So it has big generator backup power, and failover to an identical facility in another county.
One day the fire marshal went through and, as part of his inspection, hit the Big Red Switch which dropped power for the entire building. The generators wouldn’t start. The failover switches failed. The Extremely Mission Critical System went down, hard.
Turned out that no one had developed a Plan for bringing the entire system up from a cold start. How do you sequence a software system that throws errors and shuts back down if the services that are designed to be run simultaneously have to be brought up one at a time?
Took 2 days to get back up.
[…] Think cloud computing saved you from Sandy? Think again. It makes little sense for any Internet business to be dependent on a single data center. With server virtualization it is possible to put images of your server here and there to cover almost any failover problem. Not just multiple servers but multiple servers on multiple backbones in multiple cities supported by multiple power companies and backed by multiple generators. We do that even here at I, Cringely and we’re known to be idiots. […]
[…] Hacker News https://www.cringely.com/2012/11/01/think-cloud-computing-saved-you-from-sandy-think-again/ This entry was posted in Uncategorized by admin. Bookmark the […]
First law of information security: Thou shalt not forget about Murphy and his laws.
The big problem here is that 100 year engineering events are now going to happen every 10 years. It’s not that the designs are wrong, the specifications on which the designs are based are wrong. Different tradeoffs have to be made. We thought these events would happen more rarely and the cost of mitigation exceeded the benefits. But now we are beginning to realize they will happen more frequently and we get more value from the mitigation effort.
Very insightful! I think you are correct on this one.
In addition to the more frequent occurence of tail end events (a migration to fat tail activities), additional data points will be added to the consideration and audit lists. As an example, having pumps in the basement, will become a flag, fans will be checked, and the supply chain for data and information architecture will be more rescrutizined.
My guess is regulators will also push for this, if nothing else driven by the force of Operational Risk, and Reserves for Sarbanes and other reporting.
Your points sound reasonable and wise, but not everyone has a billion dollars to throw at infrastructure. As a small business owner myself, I don’t have thousands of dollars to throw at hosting every month, so I use cheap hosting at local datacenters that I know are not redundant or disaster-ready. Datacenter profit margins are thinner than razor-thin; not everyone can be Amazon or Rackspace.
If my website or my email goes down for a few days because of a backhoe accident, I’m not happy but I have to remember “I’m getting what I pay for.” I’m paying $40/month to colo my servers. I can’t expect 5-star service or support at that price level. I have backups; I could rebuild the servers within a couple of days and that’s good enough. The money I save on gold-plated hosting outweighs the money I’d lose from the downtime.
Of course, I’m not a bank or a hospital.
Very well agreed.
Not to mention, that even having two data centers requires keeping them in sync with each other – thus you need more man power, and for a small business that may or may not be an option.
It’s not a matter of whether the SMB wants to. It’s what they can afford to do or not to do. For some, they can’t afford not to have multiple data centers, but for many just getting the first is a big deal.
What Sandy really exposed was how bad our economy really is. In times of prosperity, it’s easy for companies to invest in redundant infrastructure. In bad times, I’m sure a lot of “skimping” was done with regard to disaster preparedness. Everything not immediately necessary suffers in bad economic time.
B.J.
BlaneJackson.com
” In bad times, I’m sure a lot of “skimping” was done with regard to disaster preparedness.”
Everything that I’ve seen indicates that most IT shops have been skimping on important things like infrastructure for years, regardless of how the economy was doing. Instead of taking advantage of the good times to improve their operations, they gave their defectives bigger golden parachutes.
Then the bad times come and they blame the bad times for their own stupidity.
Penny wise, ton foolish.
Netflix, before the Amazon EC2 outage:
“The best way to avoid failure is to fail constantly”
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html?m=1
Netflix after the outage:
http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html?m=1
To spare people the read: Netflix did implement their stuff right and so this October outage only caused a 20 minute downtime for some of the customers. That’s less than typical maintenance downtime.
Thanks a great read. Thanks
As to the main article. To grok redundancy you have to understand your single points of failure. Easy cloud computing is redundant providers and geography. ie get Amazon in Arlington and Asia, Google in California and North Carolina. Cloud computing makes this cheap.
To be fair to Amazon: they did advise clients to spread their software over multiple zones. It’s not easy to do that, but Amazon did at least advise it. Netflix did and so they survived.
What I don’t understand is why Apple put a datacenter in Reno where every map of backbone providers I found showed a dearth in/around Reno but quite a few of them going through the Salt Lake Valley since it’s a route through the Rocky mountains. My only conclusion is that Reno was willing to bend over and take it in the taxes where as Salt Lake was not.
Apparently there was some secrecy surrounding the deal: ” “I wanted to say something so bad,” Weber said Wednesday. “I knew about it for the last couple of months…but I wanted to tell my 17-year-old son who wants a job with Apple so bad.” ” https://www.rgj.com/article/20120628/NEWS/306270137/Reno-OK-s-fast-tracked-Apple-deal
[…] See on http://www.cringely.com […]
[…] The problem with clouds is that some of them aren't very cloudy at all. Cloud computing is for some providers more of a marketing term than anything else.www.cringely.com/…/think-cloud-computing-saved-you-from… […]
[…] Cringely, Think cloud computing saved you from Sandy? Think again. here. Note to self – Keep gas reserves for DynaPie trucks – “Scootch over, I’m […]
[…] the original: Think cloud computing saved you from Sandy? Think again. ~ I … ← The Cloud Infrastructure | Cloud Computing Journal Alteva Receives 2012 Cloud Computing […]
[…] Think cloud computing saved you from Sandy? Think again. (I, Cringely). “Even if nothing happened at your company as Sandy blew through, if you don’t know exactly why you were unscathed, now might be a good time to investigate.” […]
The irony is the fact that TCP/IP, etc was engineered to survive a nuclear attack and take advantage of multiple routing.
Lotta good that does if everything else is single point of failure.
Nothing, except cost, is stopping me from having 2 ISPs, computers, and routers. But I like to live dangerously.
I live in New Jersey, 10 miles from Manhattan. Sandy was a storm the likes I’ve never seen before. This storm was so powerful, sea water was pushed miles upstream. A local town, 20 miles from the open ocean, was flooded with sea water.
At the height of storm, I saw what I first thought were lightning flashes, but I quickly realized were flashes from electrical arcing as the overhead power lines were being pulled down. These electric arc flashes went on for hours. Our power went out around 6:30pm Monday evening. Power was restored on our street the following Thursday, but some of our neighbors were without power for 10 days. As I write this, there are still some buildings in the NY/NJ area that are without power. Interestingly, a small section of the town where I live, that was inundated during Hurricane Floyd, had no flooding, and retained power the entire time during Sandy.
The house where I live was built in 1927, and I still have copper phone lines and DSL even though fiber has been available in this neighborhood for years…..yes, I’m cheap. Our landline phone functioned the entire time…..from what I understand, the prehistoric copper network held up remarkably well. I also had cell service the entire time. As soon as power was restored, I immediately fired up the DSL and all was well. I’m assuming that DSL service never went out. At my neighbors, most of whom have FIOS, it was a different story. FIOS requires electricity to work, that’s why there is a UPS unit included with a FIOS install. My neighbors had phone and net until the UPS battery ran down and then nothing, no landline phone, net or cable.
Since Sandy blew through, I’ve talked to a dozen power company and Verizon people. What I’ve learned is this: Most of the power/communications networks in the US are a dilapidated patchwork that barely functions on a good day. Decades of avoiding upgrades and cheaping-out on repair and maintenance have left us with a third world power/communication grid, the ultimate result of putting immediate profits above all else. I’m sure the solution is to deregulate everything. Remove those pesky gov regulations and Donald Trump and Bain Capital will jump right in and build a shiny new grid like they have in ol’ Yurp…..socialist ol’ Yurp.
Yes, I can remember back in the 50s, Ronald Reagan, the actor, making TV ads for the power company promoting the advantages of all electric homes. That was back when the power company was allowed to charge for its services. Now, the law seems to be “justify every rate increase with an immediate urgent need”. And “spend your income on TV ads telling people to not use your product.”
Every state in the US, with the exception of Delaware, had utility regulations in place by 1930. Sorry, but utility companies couldn’t charge whatever they wanted for services back in the 1950s. Fact is that American utilities were the envy of the world until the Reagan era “deregulation” schemes took hold in the 1980s.
BTW, Romney lost, you really need to pull your head out of his ass.
Of course, I know they were regulated just like the phone company at the time. I never said they could charge anything they liked. But they could charge for their product, including a fair profit, with ads to encourage the use of electricity, and with an allowance for expansion and maintenance costs. As the costs rose, consumer groups with the “help” of government started to tell them how to destroy their business. The right amout of regulation is necessary for so-called natural monopolies. Too much can be detrimental.
Nice Post…..