Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

I’m going to give the short version first:

At 8:34pm CDT I start compiling a security update for the server
At 8:57pm I get a call from my mother who needs to go the ER
At 9:02 data center tech logs into server in direct violation of my contract and tries to install the same update as a binary patch (which won’t work for many reasons)
At 9:08Dumbass tech reboots server. Much bitching of various networking daemon ensues when server comes up.
at 10:05pm I finally become aware of issue when someone calls me.
10:05pm 10/25 to 8:39pm 10/26 Reinstall OS, restore backups of data, and begin reinstall applications (yes the update fucked things up enough that I had to reinstall the OS…)

So that’s the short version. Now for the longer version:

So a little background is important here: To minimize the OSes memory usage and maximize performance I custom compile everything. Part of those optimizations includes only compiling the parts of the OS and kernel that are actually needed and used. Well, okay, in the kernel’s case it still compiles everything, but only the parts that are actually used are statically linked into the kernel. The rest are compiled as loadable modules.

So, what happens when you try to binary update the OS and kernel? Well, in theory nothing bad. Theory being the keyword. However, a very odd quirk of the hardware itself (which is well documented I might add) causes the second network card to not initialize properly using the default kernel. This is why the reboot made things go tits up Saturday night. The second network card is the PUBLIC interface of the server, not the private interface.

So basically with a few exceptions everything binds to the public interface on startup. See where this is going? Yeah, everything, including Postfix failed to come up because they couldn’t bind to the interface that should have been there but wasn’t. Now if you’ll remember from the list above I was called away with a medical emergency involving my mom and didn’t see the server go down. It should be obvious, but the server was the least of my issues while at the ER, so I wasn’t trying to do anything with it and never saw it was down until someone finally called me.

So, let’s start with what I found when I attempted to login to the server using the VPN and the private interface:

The SSH service failed to start. Thus I could not login that way.

There is a Serial-Over-Lan port via the IPMI interface which is how I got a working shell. Did not take me long to figure out just how fucked the server was. NOTHING that should have been running was. I checked the logs, saw what idiot from the data center had done and figured out why I couldn’t get anything to come back up properly.

So first thought: Recompile and fix the kernel. Nope, in the process the dumb ass managed to secure the secure level to 2 meaning I couldn’t reinstall the kernel, and since I couldn’t use the other method of getting a “local” console I was left with no way to reboot into single user mode and compile/install that way (which is mandatory to install the kernel in secure level 2).

Now, the good news is the restarded keyboard monkey did this right after one of the nightly backups so I was able to use the data center’s provisioning tools to reinstall the OS remotely without risking losing any data. So that is where things sat as I was pulling back into my driveway about 2:40am on Sunday morning.

I spent the next several hours reinstalling software and restoring data from backups. This, and having to recompile all the software is what caused the slowness that continued for several hours after basic services were restored.

Oh, and the best part? They wanted to charge me for “Helping” with the security patch…

You can probably guess what I told them my opinion of that was. I actually spoke to the CEO about this and he’s assured me the situation will be dealt with. Unfortunately past experience tells me otherwise since this has happened multiple times before.

Oh and why wasn’t he supposed to do what he did? I specifically pay extra to have an unmanaged server. Techs are never supposed to login unless I specifically request it. Hell they’re not supposed to lay a finger on the physical box without getting my authorization first.

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

That seems typical data center stupidity. They forget that contracts are binding both ways.

I would make the suggestion, that as they seem to not honor contracts that you change the password and refuse to give it to them.

They have absolutely no business in logging onto an unmanaged server. If management is too stupid to actually tell their operators which server is unmanaged then a simple sticker in as large a font as possible and with a red background should be enough of an indication to even the dumbest that they need to keep their hands off.

This sounds like the same sort of dumbness as when I was working for an isp and the bloody cleaner unplugged the servers so that they could plug their vacume cleaner in.

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

I’m with Kita on this one. Almost the whole post went over my head. guess it’s good I just have my computer and don’t get to mess around with servers.

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

I run servers, but I host at home. Currently you can’t get to them from outside my LAN, but I really like having them.

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

Does self compiling really make that much of a difference. I’ve worked in distros before not telling which ones, as you can find my real name easily, and the optimizations are usually selected on a per package basis so you really shouldn’t need to resort to that, except maybe the kernel. As to binary patches my approch when doing a self compiled package, is to actually use the distros buildsystem to make sure that the update system ignores any request to update that package unless of course it’s the update that comes from me.

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

I’m going to assume you have a Linux and not *BSD background from what you said.

Big differences in how they do things regarding the OS itself. And the hardware in this server has known issues that make running FreeBSD on it a far more sane option than any Linux distro :slight_smile:

The binary packages for third party software (and the binary updates for the OS itself, which unlike linux, isn’t just “kernel + a bunch of packages”. are compiled with “safe” defaults. Those defaults? Tend to result in random reboots under real work loads on the hardware in this server. It’s a known issue with current supported versions.

The “safe” defaults for most things, especially PHP, MySQL, Apache, and Postfix? Not so good. By default they enable shit that’s not needed, and more importantlyDO NOT enable stuff that is. So yeah, I have to compile them from scratch, and to ensure the updates apply correctly without downtime it also means I have to build the build dependencies as well as the build system that ensures everything is in a consistent state can’t use the official pagckages even if the ones I’m building are an exact match.

For the base OS itself, besides what I already mentioned there are several reasons I compile from sources ASAP.

One is that the GENERIC FreeBSD kernel has horrible support for some of the hardware if the modules are dynamically loaded. For those I compile them in. Same with the software firewall. If it was dynamically loaded the server would be unaccessible until the end of the boot process instead of being accessible as soon as the daemons start as the Firewall is one of the last things configured by the networking scripts if dynamic, but the kernel module loads at the start of the boot process and defaults to denying everything when it loads. Statically compiled in you can change the default to accept and have it not interfere with daemons starting. Seriously, you’d be amazed how many daemons won’t start if the firewall is defaulting to deny connections :slight_smile:

There’s also the fact the IPMI card driver has to be compiled in to get console access at all on this thing (flaw in the design of the board itself)

THen there’s the fact that to use two of the Mandatory Access Control modules you have to compile them in or they won’t actually do anything I’m not stupid enough to run most of the daemons without restricting them both with standard file permissions and MAC labels.

tl;dr version: I compile shit out of necessity, not because I particularly enjoy it.

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

Oh, and just to be clear, the issue with the IPMI card is related to the console driver. You can see the console while the server is booting, but as soon as syscons (the default console driver in FreeBSD) loads it disables the VNC based console, and the serial over lan only works if the setting are compiled into the kernel. and THAT requires also building the OS from source to get it to work correctly…

Re: Downtime/Slowness on 10/25/2014 and 10/26/2014 lasting approximately 24 hours.

What kind of server is this a DEC Alpha or something :slight_smile: that has got to be the most screwy hardware situation I’ve heard of not the screwist. but certainly one of the more out there ones, and yes I come from a Linux background as a developer, I’ve run a few FreeBSD boxes but mostly as routers. But you just set em and forget em on a good day. Not that I’m doing anything at the moment. :frowning: My last job was a short term contract. But I’m going off on a tangent there.