I’m going to give the short version first:
At 8:34pm CDT I start compiling a security update for the server
At 8:57pm I get a call from my mother who needs to go the ER
At 9:02 data center tech logs into server in direct violation of my contract and tries to install the same update as a binary patch (which won’t work for many reasons)
At 9:08Dumbass tech reboots server. Much bitching of various networking daemon ensues when server comes up.
at 10:05pm I finally become aware of issue when someone calls me.
10:05pm 10/25 to 8:39pm 10/26 Reinstall OS, restore backups of data, and begin reinstall applications (yes the update fucked things up enough that I had to reinstall the OS…)
So that’s the short version. Now for the longer version:
So a little background is important here: To minimize the OSes memory usage and maximize performance I custom compile everything. Part of those optimizations includes only compiling the parts of the OS and kernel that are actually needed and used. Well, okay, in the kernel’s case it still compiles everything, but only the parts that are actually used are statically linked into the kernel. The rest are compiled as loadable modules.
So, what happens when you try to binary update the OS and kernel? Well, in theory nothing bad. Theory being the keyword. However, a very odd quirk of the hardware itself (which is well documented I might add) causes the second network card to not initialize properly using the default kernel. This is why the reboot made things go tits up Saturday night. The second network card is the PUBLIC interface of the server, not the private interface.
So basically with a few exceptions everything binds to the public interface on startup. See where this is going? Yeah, everything, including Postfix failed to come up because they couldn’t bind to the interface that should have been there but wasn’t. Now if you’ll remember from the list above I was called away with a medical emergency involving my mom and didn’t see the server go down. It should be obvious, but the server was the least of my issues while at the ER, so I wasn’t trying to do anything with it and never saw it was down until someone finally called me.
So, let’s start with what I found when I attempted to login to the server using the VPN and the private interface:
The SSH service failed to start. Thus I could not login that way.
There is a Serial-Over-Lan port via the IPMI interface which is how I got a working shell. Did not take me long to figure out just how fucked the server was. NOTHING that should have been running was. I checked the logs, saw what idiot from the data center had done and figured out why I couldn’t get anything to come back up properly.
So first thought: Recompile and fix the kernel. Nope, in the process the dumb ass managed to secure the secure level to 2 meaning I couldn’t reinstall the kernel, and since I couldn’t use the other method of getting a “local” console I was left with no way to reboot into single user mode and compile/install that way (which is mandatory to install the kernel in secure level 2).
Now, the good news is the restarded keyboard monkey did this right after one of the nightly backups so I was able to use the data center’s provisioning tools to reinstall the OS remotely without risking losing any data. So that is where things sat as I was pulling back into my driveway about 2:40am on Sunday morning.
I spent the next several hours reinstalling software and restoring data from backups. This, and having to recompile all the software is what caused the slowness that continued for several hours after basic services were restored.
Oh, and the best part? They wanted to charge me for “Helping” with the security patch…
You can probably guess what I told them my opinion of that was. I actually spoke to the CEO about this and he’s assured me the situation will be dealt with. Unfortunately past experience tells me otherwise since this has happened multiple times before.
Oh and why wasn’t he supposed to do what he did? I specifically pay extra to have an unmanaged server. Techs are never supposed to login unless I specifically request it. Hell they’re not supposed to lay a finger on the physical box without getting my authorization first.