Major Outage at Work
Nothing quite like waking up in the morning and getting a notification that everything is down, and the kitchen sink is clogged too.
So I woke up yesterday morning and found a single innocent little message on my personal email account telling me that my email server was no longer receiving emails. Great, Internet down again. The interesting thing is I got the message literally within minutes of me waking up.
First step, let’s see if I can VPN in. Usually this will fail right away and I know the Internet is down, but this time it doesn’t work that way! VPN in and get the login prompt which means Internet is up and the firewall is responding, but I’m unable to authenticate which means my RADIUS server (which is installed on my domain controller) isn’t responding. That’s a virtual server and since they are typically very reliable I immediately begin thinking the worst.
Luckily we just rolled out LogMeIn on all of the servers, so I jump over there and try to log in. All my HQ site servers are down except for the physical servers. Yikes. Log into one of them and I can ping my network equipment ok, but none of the virtual servers. Immediately I’m thinking my SAN has just crashed because what are the odds that 3 separate VM hosts failed at the same time? None!
Quick shower, brush the teeth drive into work. When I get in I, of course, immediately get set upon by rabid users eager to tell me there are problems with their PC’s, the Internet is down, there’s something wrong with the network, etc. Oh joy. After running away from them in terror I hide in the computer room and start looking around.
VM Hosts are all powered on. That’s a good sign. SAN is powered on, another good sign. Check the switch connecting VM’s to the SAN? Nothing. No blinky lights. Unplug the switch, plug it back in and still nothing. I know, I know, where are my redundant switches but my IT department has been running under a severely tight budget for a long time and there just wasn’t money in the budget for that kind of thing and it was accepted that we might have downtime due to it.
Luckily, I had a 16 port Netgear in the training room so I stole that and hooked it back up. Rebooted everything came back online!
Then vSphere’s High Availability kicked in and started all of my VM’s up on the first host to check in so the poor server got hammered. Try to log into vCenter and I can’t, saying it’s failing connection. Remote into the server and find the service stopped, so try to start that and that fails too with a file not found! Yikes, that sounds ugly. Try to do a repair install on vCenter but that’s not an option so just for the heck of it I reboot the vCenter server again. Luckily everything comes online after that. Here’s where having vMotion is nice–even though I only use it once or twice a year–and I’m able to move my VM’s to spread the load across the 3 VM hosts I have.
Only holdout was my Exchange 2010 server which was failing to start the Information Store. Error message indicated a time issue but it was synced up with the domain controller OK so I just threw another reboot at it and it came up without issue.
Whew, what a morning!
Of course, I have the little problem that my entire core network is running off of a $50 Netgear consumer level switch, but who’s worried, right? On the plus side, I get to purchase a Meraki MS22 switch which runs under their awesome cloud based management solution. I’ve been slowly switching my wireless network over to it and I love the way you can configure things globally with their setup. I’m looking forward to expanding the Meraki family of devices and seeing what the switch management is like.
Now–what else can I break today?!