Author Topic: Site Outage On Sept 12 @ ~ 12:30 Pm Pst (gmt+8) (Read 13070 times)

offroadgeek · « **on:** September 12, 2005, 11:18:41 pm »

Due to a power outage in Los Angeles, Dreamhost had a complete outage in their hosting facilities and we of course lost our sites

It seems for some reason that ZUG came back up much sooner than the OESF site or the forums, and it seems that the OESF site is still down (yet the forums are back up... with db connectivity issues here and there).

Who knows if the sites will be stable in the next few hours/days. I suspect that if the lights stay on then we'll be OK...

offroadgeek · « **Reply #1 on:** September 14, 2005, 10:57:13 am »

Is it me or has the responsiveness of the forums gone down significantly since the power outage?

albertr · « **Reply #2 on:** September 14, 2005, 11:35:55 am »

It seems to be slow for me too.
-albertr

raybert · « **Reply #3 on:** September 14, 2005, 12:44:00 pm »

I've noticed that my imap server (hosted on dreamhost) has also been slow since the outage. I suspect they're still working out issues.

~ray

icruise · « **Reply #4 on:** September 14, 2005, 04:06:28 pm »

The forums are VERY slow.

doseas · « **Reply #5 on:** September 14, 2005, 08:56:04 pm »

Very slow. And it's definitely not network congestion (I'm located in L.A., by the way):

Code: [Select]

tracert www.oesf.org

Tracing route to www.oesf.org [205.196.211.135] over a maximum of 30 hops:

  1    45 ms   <10 ms   <10 ms  10.9.37.203
  2     4 ms     3 ms     2 ms  router [10.9.40.99]
  3     *        *        *     Request timed out.
  4    73 ms     6 ms     3 ms  64.215.88.89
  5     8 ms     6 ms     4 ms  s9-0-1.ar2.LAX1.gblx.net [64.215.184.29]
  6     6 ms     6 ms     6 ms  so0-3-0-622M.ar1.LAX3.gblx.net [67.17.64.49]
  7    22 ms    24 ms    22 ms  GE1-GX.dreamhost.com [67.17.162.162]
  8    23 ms    24 ms    24 ms  oesf.org [205.196.211.135]

Trace complete.

albertr · « **Reply #6 on:** September 14, 2005, 09:20:43 pm »

I'm not a network engineer, but my trace doesn't look that good, i.e.:

Code: [Select]

 7  tbr1-p012301.phlpa.ip.att.net (12.123.137.62)  61.112 ms (244)  80.987 ms (244)  125.417 ms (244)
 8  tbr1-cl8.n54ny.ip.att.net (12.122.2.17)  65.515 ms (245)  100.357 ms (245)  98.675 ms (245)
 9  ggr1-p360.n54ny.ip.att.net (12.123.1.121)  83.972 ms (247)  506.34 ms (247)  307.646 ms (247)
10  so1-1-0-622M.ar1.NYC1.gblx.net (208.51.134.5)  154.849 ms (245)  62.967 ms (245)  25.692 ms (245)
11  so0-0-0-622M.ar1.LAX3.gblx.net (67.17.73.38)  142.590 ms (241)  133.561 ms (241)  233.319 ms (241)
12  GE1-GX.dreamhost.com (67.17.162.162)  393.529 ms (241)  481.75 ms (241)  250.506 ms (241)
13  oesf.org (205.196.211.135)  101.460 ms (49)  149.926 ms (49)  129.429 ms (49)
-bash-2.05b$

Look how ttl changed from 241 to 49!
-albertr

albertr · « **Reply #7 on:** September 15, 2005, 02:34:04 pm »

As of today's morning, the speed is back to normal for me.
-albertr

offroadgeek · « **Reply #8 on:** September 15, 2005, 11:02:28 pm »

I got a nice email from Dreamhost earlier today... good details, and the specifically call out that their working on the speed issues...

Quote

On Monday, September 12 the greater Los Angeles area experienced a major power outage
affecting large sections of the city, including our main data center. The power
outage began
shortly before 1pm PST and continued until about 4:30pm PST. Our data center is
equipped
with a redundant backup power system with both battery UPS systems and diesel
generators,
but the backup failed and our entire data center was powered down.

We have previously covered much of this information on our official weblog (http://
blog.dreamhost.com/) but many of you have not seen that information so we will
summarize
the events here.

When the grid power to our building was cut, the UPS system kicked in and kept
everything in
the building up and running. The five generators also fired up and began providing
power.
The building needs four generators to operate at full power so the system is
designed to
tolerate a single failure. Unfortunately, two of the five generators failed within
minutes of each
other. We receive our power from the building housing our data center and they also
manage
the redundant power system. We do not know the exact reason for the generator
failures at
this time. We have received some vague explanations that we have not found to be
satisfactory.
Regardless, the remaining three generators were not sufficient to meet the
building's power
needs and that caused the emergency electrical systems to transfer into a â€œload
shedding
modeâ€ and the buildingâ€™s UPS system to turn itself off, thus preventing permanent
UPS and
related equipment damage. That shut everything down, including emergency lighting,
and the
building was evacuated.

About 15 minutes later, one of the generators was started up to power emergency
lighting and
a couple of our senior technicians made their way into the (still evacuated)
building and down
to our data center to assess the damage. Since the backup power had failed, our own
data
center power remained off until the main grid power came back. We then proceeded to
slowly
power up our equipment. Servers (and all computers) consume significantly more
power when
booting up than when up and running so there is some risk of overloading the power
circuits if
too many of them are flipped on at once. Keeping that in mind, we powered
everything on as
quickly as possible. At that time the majority of our services were fully back up
and running
but some services were still down and we began the process of systematically
verifying all
services and making any necessary repairs and adjustments. Whenever a large number of
servers suddenly loses power a certain small percentage of them will not come back
up properly
and when you have several hundred servers it takes awhile to verify all of them.

Once our own access to our servers was restored our staff continued working into the
night to
restore as much service as possible and to respond to as many of your support cases as
possible. Some of our staff continued working all the way through the night and we
were able
to restore almost everything that first night.

Tuesday (September 13) started off early with all of us addressing the residual
issues. At
around noon that day one of our core routers experienced an internal failure
stemming from
damage previously sustained during the power outage. Our routers handle all of the
Internet
traffic coming in and out of our network and they are set up in a redundant way to
minimize
network disruption when a failure does occur. In this case, the main cpu of the
router (called
the 'supervisor') died and the secondary one took over. Everything continued
working almost as
it should have, but there is a remaining router issue that we are still working with
Cisco support
on. That issue is responsible for the slower than normal performance of our network
and it will
be resolved absolutely as soon as possible.

During this outage, our off-network Emergency Status Page
(http://status.dreamhost.com/)
proved to be an invaluable resource for disseminating information among our
customers. That
status page remained up throughout the power outage and was updated regularly as we
received new information. Unfortunately, not everyone knows about it and we will be
working
to improve that situation in the coming days. Those bloggers among you that did
know to
check the status page were extra helpful in passing along the information to other
dreamhosters who were still in the dark. Thank you to everyone who helped out with
that!

This announcement will be followed by another explaining what went wrong with our
processes
and what we plan to do to address them. That will come in the next few days.

We will be continuing to provide more detailed information on our official weblog
found here:
http://blog.dreamhost.com/

Also, everyone who has not bookmarked our Emergency Status Page should do so now.
That
page is found here:
http://status.dreamhost.com/

We will be improving on the basic page we have there to provide as useful of an
avenue of
information as possible.

If you have any additional questions about this outage, please let us know. We will
be happy to
address all of your questions or concerns.

The Un-Happy DreamHost Powerless Team

News:

Author Topic: Site Outage On Sept 12 @ ~ 12:30 Pm Pst (gmt+8) (Read 13070 times)