The Surly Admin

Father, husband, IT Pro, cancer survivor

Major Outage at Work

Nothing quite like waking up in the morning and getting a notification that everything is down, and the kitchen sink is clogged too.

So I woke up yesterday morning and found a single innocent little message on my personal email account telling me that my email server was no longer receiving emails.  Great, Internet down again.  The interesting thing is I got the message literally within minutes of me waking up.

First step, let’s see if I can VPN in.  Usually this will fail right away and I know the Internet is down, but this time it doesn’t work that way!  VPN in and get the login prompt which means Internet is up and the firewall is responding, but I’m unable to authenticate which means my RADIUS server (which is installed on my domain controller) isn’t responding.  That’s a virtual server and since they are typically very reliable I immediately begin thinking the worst.

Luckily we just rolled out LogMeIn on all of the servers, so I jump over there and try to log in.  All my HQ site servers are down except for the physical servers.  Yikes.  Log into one of them and I can ping my network equipment ok, but none of the virtual servers.  Immediately I’m thinking my SAN has just crashed because what are the odds that 3 separate VM hosts failed at the same time?  None!

Quick shower, brush the teeth drive into work.  When I get in I, of course, immediately get set upon by rabid users eager to tell me there are problems with their PC’s, the Internet is down, there’s something wrong with the network, etc.  Oh joy.  After running away from them in terror I hide in the computer room and start looking around.

VM Hosts are all powered on.  That’s a good sign.  SAN is powered on, another good sign.  Check the switch connecting VM’s to the SAN?  Nothing.  No blinky lights.  Unplug the switch, plug it back in and still nothing.  I know, I know, where are my redundant switches but my IT department has been running under a severely tight budget for a long time and there just wasn’t money in the budget for that kind of thing and it was accepted that we might have downtime due to it.

Luckily, I had a 16 port Netgear in the training room so I stole that and hooked it back up.  Rebooted everything came back online!

Then vSphere’s High Availability kicked in and started all of my VM’s up on the first host to check in so the poor server got hammered.  Try to log into vCenter and I can’t, saying it’s failing connection.  Remote into the server and find the service stopped, so try to start that and that fails too with a file not found!  Yikes, that sounds ugly.  Try to do a repair install on vCenter but that’s not an option so just for the heck of it I reboot the vCenter server again.  Luckily everything comes online after that.  Here’s where having vMotion is nice–even though I only use it once or twice a year–and I’m able to move my VM’s to spread the load across the 3 VM hosts I have.

Only holdout was my Exchange 2010 server which was failing to start the Information Store.  Error message indicated a time issue but it was synced up with the domain controller OK so I just threw another reboot at it and it came up without issue.

Whew, what a morning!

Of course, I have the little problem that my entire core network is running off of a $50 Netgear consumer level switch, but who’s worried, right?  On the plus side, I get to purchase a Meraki MS22 switch which runs under their awesome cloud based management solution.  I’ve been slowly switching my wireless network over to it and I love the way you can configure things globally with their setup.  I’m looking forward to expanding the Meraki family of devices and seeing what the switch management is like.

Who's Worried?

Now–what else can I break today?!

Advertisements

March 14, 2013 - Posted by | Technical | , ,

3 Comments »

  1. Great blog article Martin !!

    Bob

    Comment by Bob | March 14, 2013 | Reply

    • Thanks Bob! Or just happy it didn’t happen to you?! 😉

      Comment by Martin9700 | March 14, 2013 | Reply

  2. […] of the fallouts of the outage I had this week (Read Here) was that my backups for that night were stopped.  Since Veeams creates snapshots to do it’s […]

    Pingback by Report All Snapshots in Your VMware Environment « The Surly Admin | March 18, 2013 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: