- Post Mortum for Maintenance on 5/19/2008
We had a scheduled maintenance tonight. It was scheduled to last only 15 minutes. However, much to our dismay, some customers were down close to 2 hours. Here's the skinny on what took place...
- After all the servers were moved, our data center folks switched over our IP blocks to the new rack. However, they did not do it right. Not all of the routes become available. This was remedied within a few minutes.
- Next, server e46 would not boot. The dreaded "Operating System Not Found" message appeared. It appears that the 3ware RAID controller card failed. After a few hair-raising minutes, we took the RAID drives out of that machine, and put them into a brand new server we had on standby. Presto, e46 booted up. The customers on e46 had to deal with some extra downtime, but got a performance boost in the process (from 2 dual core Xeons to 2 quad-core Xeons).
- Server m3 upon boot had a bad disk. This prevented the machine from booting up quickly. We replaced the bad disk in the RAID and it booted up.
- The routing tables between our data center and our servers were not responding properly. What happened was that each customer VPS started up before the routes were in place. After the routes came up, the VPSs did not recognize the new routes, and thus were unreachable. It turned out that restarting each VPS solved the problem.
SilverRack always takes downtime very seriously. We are not happy with the events of today. What was scheduled for 15 minutes of downtime turned into close to 2 hours. We will review our procedures and make necessary adjustments so that this does not happen again.
Please accept our apologies. Please email us if you have any questions or concerns.
- Posted May 19, 2008 21:06 by dave