Big data center down now too…

Robert Scoble Uncategorized July 24, 2007 1 Minute

San Francisco’s 365 Main datacenter went down with a power outage, which knocked off Six Apart, Technorati, and Craigslist, among others. O’Reilly’s Radar is tracking the situation. This might have something to do with the Netflix datacenter having trouble too.

What I wonder is just how did this happen? Datacenters usually have their own power systems and backups (the ones I’ve been in have both huge uninterrupted power supplies which are literally huge batteries as well as generators that they can fire up if it looks like power won’t come back on soon). Sounds like someone really screwed up, or, the infrastructure isn’t being paid attention to — both of which are bad for the tech industry. This is especially bad for a datacenter that is located miles from two major earthquake faults. If we have a big earthquake here it’s conceivable that power would be out for days if the right lines got cut.

Published by Robert Scoble

I help entrepreneurs build their technology business' story, help with getting ready for investors, with other launch plans, and many other strategic things that can help your new startup. Call to talk: +1-425-205-1921 (text first). View all posts by Robert Scoble

Published July 24, 2007

45 thoughts on “Big data center down now too…”

Darren Straight says:

July 24, 2007 at 10:05 pm

Ekkkk that’s not good at all! :-O

LikeLike
Darren Straight says:

July 24, 2007 at 3:05 pm

Ekkkk that’s not good at all! :-O

LikeLike
Neuromancer says:

July 24, 2007 at 10:06 pm

umm

You would have thought sites that big would have multiple centres by now especialy given earthquake risks in california.

LikeLike
Neuromancer says:

July 24, 2007 at 3:06 pm

umm

You would have thought sites that big would have multiple centres by now especialy given earthquake risks in california.

LikeLike
Rob La Gesse says:

July 24, 2007 at 10:14 pm

🙂 This is exactly why Microsoft is building a new data center in our fairly cheap power, no earthquake no hurricane city of San Antonio!

Oh – and Google – they are already here, but in a much smaller profile than the new MS datacenter.

Rob

LikeLike
Rob La Gesse says:

July 24, 2007 at 3:14 pm

🙂 This is exactly why Microsoft is building a new data center in our fairly cheap power, no earthquake no hurricane city of San Antonio!

Oh – and Google – they are already here, but in a much smaller profile than the new MS datacenter.

Rob

LikeLike
Sprezzatura says:

July 24, 2007 at 10:29 pm

Somebody screwed the pooch, bigtime.

ServePath is on battery backup but still operational.

LikeLike
Sprezzatura says:

July 24, 2007 at 3:29 pm

Somebody screwed the pooch, bigtime.

ServePath is on battery backup but still operational.

LikeLike
Beau says:

July 24, 2007 at 10:34 pm

I’m in SoMa right now. There were a series of outages between 2 and 3 PM PDT, each lasting a minute or two.

LikeLike
Beau says:

July 24, 2007 at 3:34 pm

I’m in SoMa right now. There were a series of outages between 2 and 3 PM PDT, each lasting a minute or two.

LikeLike
Fritz says:

July 24, 2007 at 10:35 pm

Even San Antonio is subject to backhoe fade, which is the likely culprit in today’s outages. According to the IT staff at my employer (which was also affected), the datacenters are fine, it’s the network switches connecting them to the outside world that are having problems.

LikeLike
Fritz says:

July 24, 2007 at 3:35 pm

Even San Antonio is subject to backhoe fade, which is the likely culprit in today’s outages. According to the IT staff at my employer (which was also affected), the datacenters are fine, it’s the network switches connecting them to the outside world that are having problems.

LikeLike
Pingback: Power Outages In San Francisco Bring Down Major Websites | Laughing Squid
Pingback: ianneubert.com » Blog Archive » Netflix has been down for half a day
Rob La Gesse says:

July 24, 2007 at 10:43 pm

Uhmmm… actually, a properly designed datacenter has numerous connections to the backbone all coming in from different places. If all the inputs were run in one bundle, then that’s a bonehead problem, not a backhoe problem!

Rob

LikeLike
Rob La Gesse says:

July 24, 2007 at 3:43 pm

Uhmmm… actually, a properly designed datacenter has numerous connections to the backbone all coming in from different places. If all the inputs were run in one bundle, then that’s a bonehead problem, not a backhoe problem!

Rob

LikeLike
Eric Rice says:

July 24, 2007 at 10:45 pm

So, I suppose that when disaster strikes and takes out a key data center, we can update via Twitter and Facebook right? Oh, wait, nevermind.

LikeLike
Eric Rice says:

July 24, 2007 at 3:45 pm

So, I suppose that when disaster strikes and takes out a key data center, we can update via Twitter and Facebook right? Oh, wait, nevermind.

LikeLike
Rob La Gesse says:

July 24, 2007 at 10:51 pm

Carrier Pigeons. I hear Google has billions of them “sitting in the wings” just in case…

Rob

LikeLike
Rob La Gesse says:

July 24, 2007 at 3:51 pm

Carrier Pigeons. I hear Google has billions of them “sitting in the wings” just in case…

Rob

LikeLike
dawn m. armfield says:

July 24, 2007 at 11:08 pm

I was wondering if Netflix’s problem was in relation but they’ve been down for hours and the datacenter hasn’t, has it?

LikeLike
dawn says:

July 24, 2007 at 4:08 pm

I was wondering if Netflix’s problem was in relation but they’ve been down for hours and the datacenter hasn’t, has it?

LikeLike
Eric Rice says:

July 24, 2007 at 11:28 pm

I’ve updated: http://www.ericrice.com/blog/2007/07/24/we-fail-the-data-center-facebook-jokes-cpr/

LikeLike
Eric Rice says:

July 24, 2007 at 4:28 pm

I’ve updated: http://www.ericrice.com/blog/2007/07/24/we-fail-the-data-center-facebook-jokes-cpr/

LikeLike
Pingback: Top Posts « WordPress.com
gregger says:

July 25, 2007 at 12:22 am

ValleyWag says drinking and data centers don’t mix!
http://feeds.gawker.com/~r/valleywag/full/~3/137005052/a-drunk-employee-kills-all-of-the-websites-you-care-about-282021.php

That’s a little scary… maybe it’s true, and maybe it’s ValleyWag!

TTFN

LikeLike
gregger says:

July 24, 2007 at 5:22 pm

ValleyWag says drinking and data centers don’t mix!
http://feeds.gawker.com/~r/valleywag/full/~3/137005052/a-drunk-employee-kills-all-of-the-websites-you-care-about-282021.php

That’s a little scary… maybe it’s true, and maybe it’s ValleyWag!

TTFN

LikeLike
Robert Scoble says:

July 25, 2007 at 12:40 am

>gregger: that story isn’t true. I don’t believe anything that Valleywag prints anymore cause they’ll just print any damn thing you email in.

LikeLike
Robert Scoble says:

July 24, 2007 at 5:40 pm

>gregger: that story isn’t true. I don’t believe anything that Valleywag prints anymore cause they’ll just print any damn thing you email in.

LikeLike
tapster says:

July 25, 2007 at 1:01 am

unfortunately this kind f thing does happen occasionally. I had the same experience with out tier one data center provider last year. Fully redundant power, tested monthly, but an unexpected sequence of events, and a mistimed override, brought the whole thing down.
Interested to hear what happened at 365 though. Sitting here in SOMA we lost power about 5 times for 30 seconds each time

LikeLike
tapster says:

July 24, 2007 at 6:01 pm

unfortunately this kind f thing does happen occasionally. I had the same experience with out tier one data center provider last year. Fully redundant power, tested monthly, but an unexpected sequence of events, and a mistimed override, brought the whole thing down.
Interested to hear what happened at 365 though. Sitting here in SOMA we lost power about 5 times for 30 seconds each time

LikeLike
Pingback: Marc’s Voice » Blog Archive » Greetings from Trieste, Italy - blog links #4
Rusty Hodge says:

July 25, 2007 at 1:24 am

Here’s what really went down at 365main:

365main, like all facilities built by Above.net back in the day, doesn’t have a battery backup UPS. Instead, they have these things called “CPS”, or continuious power systems. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365main. PGE power drives the motors which turn the flywheels which then turn the generators (or alternators, I don’t remember the exact details) which in turn power the facility. There are 10 of these on their roof.

The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.

There are also 10 large diesel engines up on the roof as well, connected to these flywheels. If the power is out for more than 15 seconds, the generators start up, and clutch in and drive the flywheels. There are no generators in the basement. (There is a large duel storage in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well. )

Here’s what I think happened. Since there were several brief outages in a row before the power went out for good, it seems that the CPS (flywheel) systems weren’t fully back up to speed when the next outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn’t have time to get back up to full capacity. By the 6th power glitch, there wasn’t enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.

Why they just didn’t manually switch on the generators at that point is beyond me.

So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.

Here’s the letter they sent to their customers about this:

This afternoon a power outage in San Francisco affected the 365 Main St. data
center. In the process of 6 cascading outages, one of the outages was not
protected and reset systems in many of the colo facilities of that building.
This resulted in the following:

– Some of our routers were momentarily down, causing network issues. These
were resolved within minutes. Network issues would have been noticed in our
San Francisco, San Jose, and Oakland facilities.

– DNS servers lost power and did not properly come back up. This has been
resolved after about an hour of downtime and may have caused issues for many
GNi customers that would appear as network issues

– Blades in the BC environment were reset as a result of the power loss.
While all boxes seem to be back up we are investigating issues as they come in

– One of our SAN systems may have been affected. This is being checked on
right now

If you have been experiencing network or DNS issues, please test your
connections again. Note that blades in the DVB environment were not affected.

We apologize for this inconvenience. Once the current issues at hand are
resolved, we will be investigating why the redundancy in our colocation power
did not work as it should have, and we will be producing a postmortem report.

LikeLike
Rusty Hodge says:

July 24, 2007 at 6:24 pm

Here’s what really went down at 365main:

365main, like all facilities built by Above.net back in the day, doesn’t have a battery backup UPS. Instead, they have these things called “CPS”, or continuious power systems. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365main. PGE power drives the motors which turn the flywheels which then turn the generators (or alternators, I don’t remember the exact details) which in turn power the facility. There are 10 of these on their roof.

The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.

There are also 10 large diesel engines up on the roof as well, connected to these flywheels. If the power is out for more than 15 seconds, the generators start up, and clutch in and drive the flywheels. There are no generators in the basement. (There is a large duel storage in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well. )

Here’s what I think happened. Since there were several brief outages in a row before the power went out for good, it seems that the CPS (flywheel) systems weren’t fully back up to speed when the next outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn’t have time to get back up to full capacity. By the 6th power glitch, there wasn’t enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.

Why they just didn’t manually switch on the generators at that point is beyond me.

So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.

Here’s the letter they sent to their customers about this:

This afternoon a power outage in San Francisco affected the 365 Main St. data
center. In the process of 6 cascading outages, one of the outages was not
protected and reset systems in many of the colo facilities of that building.
This resulted in the following:

– Some of our routers were momentarily down, causing network issues. These
were resolved within minutes. Network issues would have been noticed in our
San Francisco, San Jose, and Oakland facilities.

– DNS servers lost power and did not properly come back up. This has been
resolved after about an hour of downtime and may have caused issues for many
GNi customers that would appear as network issues

– Blades in the BC environment were reset as a result of the power loss.
While all boxes seem to be back up we are investigating issues as they come in

– One of our SAN systems may have been affected. This is being checked on
right now

If you have been experiencing network or DNS issues, please test your
connections again. Note that blades in the DVB environment were not affected.

We apologize for this inconvenience. Once the current issues at hand are
resolved, we will be investigating why the redundancy in our colocation power
did not work as it should have, and we will be producing a postmortem report.

LikeLike
Pingback: The Feed Bag
Tony Toews says:

July 25, 2007 at 3:46 am

“power out for days”. If the big one hits try weeks. How long were parts of Seattle down recently? Up to ten days. And that was with little damage done to other regions and states.

I hope everyone reading has food and water for at least ten days. Officially FEMA tells you three days. Unofficially ten days at a presentation I attended last year.

LikeLike
Tony Toews says:

July 24, 2007 at 8:46 pm

“power out for days”. If the big one hits try weeks. How long were parts of Seattle down recently? Up to ten days. And that was with little damage done to other regions and states.

I hope everyone reading has food and water for at least ten days. Officially FEMA tells you three days. Unofficially ten days at a presentation I attended last year.

LikeLike
gregger says:

July 25, 2007 at 2:09 pm

Yeah, that’s why I said “maybe it’s true, and maybe it’s ValleyWag!” I thought the story was way too funny to ignore… I see they attack you all the time, so I knew you would get a kick out of their reporting…

TTFN

LikeLike
gregger says:

July 25, 2007 at 7:09 am

Yeah, that’s why I said “maybe it’s true, and maybe it’s ValleyWag!” I thought the story was way too funny to ignore… I see they attack you all the time, so I knew you would get a kick out of their reporting…

TTFN

LikeLike
DataGuy35 says:

July 25, 2007 at 9:21 am

Echoing some of the earlier comments, each summer there are brownouts around the country. Smart companies plan for this and have multiple data centers. i/o Data Centers is another excellent alternate data center choice in Phoenix, AZ where there are few California weather or other risks. They have the generators and backup to make sure this doesn’t happen.

LikeLike
DataGuy35 says:

July 25, 2007 at 4:21 pm

Echoing some of the earlier comments, each summer there are brownouts around the country. Smart companies plan for this and have multiple data centers. i/o Data Centers is another excellent alternate data center choice in Phoenix, AZ where there are few California weather or other risks. They have the generators and backup to make sure this doesn’t happen.

LikeLike
raincoaster says:

July 25, 2007 at 9:06 pm

When are you people gonna move to Vancouver? No brownouts here. Also no government spyware (at least, none that runs well enough to be useful!).

Here’s the laughing squid story:
http://laughingsquid.com/massive-power-outages-hit-san-franciscos-soma-district/

It’s strange: this happened what, two days after the two largest bandwidth providers in the US also failed, causing some WordPress.com blogs (like mine!) to be inaccessible for up to several hours.

It’s just…weird. I’m getting my tinfoil helmet.

LikeLike
raincoaster says:

July 25, 2007 at 2:06 pm

When are you people gonna move to Vancouver? No brownouts here. Also no government spyware (at least, none that runs well enough to be useful!).

Here’s the laughing squid story:
http://laughingsquid.com/massive-power-outages-hit-san-franciscos-soma-district/

It’s strange: this happened what, two days after the two largest bandwidth providers in the US also failed, causing some WordPress.com blogs (like mine!) to be inaccessible for up to several hours.

It’s just…weird. I’m getting my tinfoil helmet.

LikeLike
Pingback: A Single Point of Failure « UNIX Administratosphere
Pingback: iNotebook.mobi » Rojo: Web 2.0 Outage; YouTube Democracy; Becks in LA