Server Outage?

|

I was in the middle of doing an rsync backup of the server when I lost communications with it. I did a few traceroutes and filed a trouble ticket with the colocation service. Follow along to see what happened.

Here's the ticket I filed:

IPs in question: 69.60.124.47-52, 69.60.124.169-170

I was interactively using 69.60.124.47 and I lost my connection. I did a traceroute to the IP over the course of about 15 minutes. Early on the traceroute would die pretty high up but after a while it died fairly close to the server. Here is my latest traceroute:

traceroute to 69.60.124.47, 30 hops max, 40 byte packets
1 153.90.199.254 (153.90.199.254) 0.284 ms 0.323 ms 0.367 ms
2 192.105.205.41 (192.105.205.41) 3.122 ms 3.114 ms 3.176 ms
3 192.105.205.49 (192.105.205.49) 3.368 ms 3.499 ms 3.419 ms
4 icar-sttlwa01-so-2-0-1--16.infra.pnw-gigapop.net (209.124.188.144) 21.748 ms 21.865 ms 21.649 ms
5 pnwgp-cust.tr01-sttlwa01.transitrail.net (137.164.131.186) 21.840 ms 21.836 ms 21.919 ms
6 te4-3--301.tr01-sttlwa01.transitrail.net (137.164.131.185) 21.905 ms 21.666 ms 21.640 ms
7 te4-1--160.tr01-plalca01.transitrail.net (137.164.129.34) 39.209 ms 39.287 ms 39.360 ms
8 gi2-3.mpd01.sjc04.atlas.cogentco.com (154.54.11.17) 39.556 ms 39.638 ms 39.619 ms
9 te8-2.ccr02.sfo01.atlas.cogentco.com (154.54.7.173) 40.407 ms vl3490.mpd01.sfo01.atlas.cogentco.com (154.54.2.165) 40.601 ms 41.088 m
s
10 te4-4.ccr02.sjc01.atlas.cogentco.com (154.54.2.138) 41.256 ms 41.357 ms te7-4.mpd01.sjc01.atlas.cogentco.com (154.54.6.134) 41.641 ms
11 te3-2.mpd01.lax01.atlas.cogentco.com (154.54.5.182) 79.107 ms te7-2.ccr02.lax01.atlas.cogentco.com (154.54.5.70) 49.863 ms 49.818 ms
12 te4-2.ccr01.sna02.atlas.cogentco.com (154.54.3.37) 50.264 ms te2-4.mpd01.iah01.atlas.cogentco.com (154.54.5.101) 92.798 ms te8-3.ccr02
.iah01.atlas.cogentco.com (154.54.3.185) 92.363 ms
13 te4-2.ccr01.mia01.atlas.cogentco.com (154.54.24.198) 114.558 ms * *
14 te4-2.ccr01.phx02.atlas.cogentco.com (154.54.7.85) 61.037 ms vl3512.na21.b015452-0.mia01.atlas.cogentco.com (66.250.14.182) 115.361 ms
vl3812.na21.b015452-0.mia01.atlas.cogentco.com (66.250.14.186) 116.206 ms
15 infolink.demarc.cogentco.com (38.112.4.126) 115.960 ms te4-2.ccr01.aus01.atlas.cogentco.com (154.54.1.165) 92.522 ms infolink.demarc.c
ogentco.com (38.112.4.126) 114.831 ms
16 64.251.7.99 (64.251.7.99) 115.675 ms 116.297 ms 115.903 ms
17 te3-2.ccr01.mia01.atlas.cogentco.com (154.54.24.194) 114.714 ms * *
18 * vl3512.na21.b015452-0.mia01.atlas.cogentco.com (66.250.14.182) 115.635 ms *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *

Can you reach http://www.montanalinux.org/ ? If so then the server is fine. Please do NOT restart the server if it is functioning. I'm curious if it is a temporary routing issue or if indeed the server needs to be restarted

Here's the response I got back:

The servers Mirrors are failing to mount and the boot process ends in "Kernel Panic"

Oh no... here's how I replied:

Hmmm, that is not good. We have three 250GB drives in it with Linux software RAID1. Two drives are a RAID1 and the third drive is a spare. If it won't boot, I can't fix it from here. Any way I could have one of your techs look at it?

Here's information about the server and how the hard drive was setup:

- - - - -

/dev/md0 /boot ext3 defaults 1 2
/dev/md1 / ext3 defaults 1 1
/dev/md2 swap swap defaults 0 0
/dev/md3 /vz ext3 defaults 1 2

/dev/md0 = /boot RAID1 made from /dev/sda1 and /dev/sdb1. /dev/sdc1 is a spare.
/dev/md1 = / RAID1 made from /dev/sda2 and /dev/sdb2. /dev/sdc2 is a spare.
/dev/md2 = swap RAID1 made from /dev/sda3 and /dev/sdb3. /dev/sdc3 is a spare.
/dev/md3 = /vz RAID1 made from /dev/sda4 and /dev/sdb4. /dev/sdc4 is a spare.

- - - - -

Disk /dev/sda: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 32 257008+ fd Linux raid autodetect
/dev/sda2 33 2643 20972857+ fd Linux raid autodetect
/dev/sda3 2644 3165 4192965 fd Linux raid autodetect
/dev/sda4 3166 30401 218773170 fd Linux raid autodetect

Disk /dev/sdb: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 32 257008+ fd Linux raid autodetect
/dev/sdb2 33 2643 20972857+ fd Linux raid autodetect
/dev/sdb3 2644 3165 4192965 fd Linux raid autodetect
/dev/sdb4 3166 30401 218773170 fd Linux raid autodetect

Disk /dev/sdc: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdc1 1 32 257008+ fd Linux raid autodetect
/dev/sdc2 33 2643 20972857+ fd Linux raid autodetect
/dev/sdc3 2644 3165 4192965 fd Linux raid autodetect
/dev/sdc4 3166 30401 218773170 fd Linux raid autodetect

- - - - -

[root@new ~]# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Thu Mar 2 06:14:38 2006
Raid Level : raid1
Array Size : 256896 (250.88 MiB 263.06 MB)
Device Size : 256896 (250.88 MiB 263.06 MB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Nov 19 13:21:53 2006
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 -1 spare /dev/sdc1
UUID : 7076972a:584d42ab:e65b598c:7735026e
Events : 0.2462

[root@new ~]# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Thu Mar 2 06:14:23 2006
Raid Level : raid1
Array Size : 20972736 (20.00 GiB 21.48 GB)
Device Size : 20972736 (20.00 GiB 21.48 GB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Sun Nov 19 14:09:37 2006
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
2 8 34 -1 spare /dev/sdc2
UUID : 5f39f872:38f3fa5e:afdf488d:9e02b7b1
Events : 0.3343442

[root@new ~]# mdadm --detail /dev/md3
/dev/md3:
Version : 00.90.03
Creation Time : Thu Mar 2 06:14:41 2006
Raid Level : raid1
Array Size : 218773056 (208.64 GiB 224.02 GB)
Device Size : 218773056 (208.64 GiB 224.02 GB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Sun Nov 19 14:10:09 2006
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 8 20 1 active sync /dev/sdb4
2 8 36 -1 spare /dev/sdc4
UUID : 9be3d5f2:2729f5d0:10af8bc1:43f01c65
Events : 0.9107996

- - - - -

I'm assuming the first drive in the RAID1 failed and that's why it won't boot. What needs to be done is... the bad drive needs to be removed from the software RAID and the spare needs to be added in. Hopefully someone there is familiar with Linux software RAID. The server is running CentOS 4.6.

If not, I'll have to have the machine shipped back to me and fix it my self and then ship it back. More on that in continuing communications though.

Then I got back:

We will do what we can. Please stand by.

Then about 5 minutes later I got:

From right to left lets call the drive bays R M L. I moved the HDD in Bay R to Bay L. I moved the HDD from Bay M to Bay R and the HDD from Bay L to Bay M.

The server booted.

And I replied:

Thank you. I am in now. Doing a complete rsync backup. Stopped all OpenVZ containers. I happened to have been in the middle of a backup when it died on me.

YOU GUYS ARE TOPS!

Their final reply was:

Good to hear. Let us know.

Thank you.

Checked the server. The kernel noticed the drive changes, added the spare to the RAID, and the rebuilt the RAIDs. Everything was back to normal except there is no longer a spare. I responded back to let them know that everything was under control now so I was going to close the ticket.

I shut down all of the containers, did a complete backup and then started them back up again. I back up each OpenVZ container separately... and I also backup the host node separately.

I'd like to mention that we have been using Colopronto since March of 2006 and their service has been great even if it is a budget place. We pay $26.65 per month total and that includes 8 IP addresses. Total downtime as best as I can figure... about 3 hours. Not bad for a software RAID failure on a remote machine on a Saturday, eh?

It would *REALLY* be nice if we had a second server there for redundancy. Maybe someday. Doing live migrations of OpenVZ containers is easy.