Network Gremlins

I don’t know if it’s a universal I.T. thing or not, but at my place of employment we sysadmins have taken to blaming any freak accident/unexplainable computer phenomenon/Series of Unfortunate Events on “gremlins”. A person couldn’t log on five minutes ago and all of a sudden, they can? Gremlins. USB sticks now mounting when, previously, they weren’t? Gremlins. You get the picture.
Gremlins!Well, last Friday and today have been some of the most gremlin-filled days in recent memory, bar none. We’ve all tried to be sanguine about the whole affair and just shrug our shoulders and mutter “Gremlins!”, but that only takes one so far. Perhaps we bought a cursed Cisco box with a time-delayed Curse Activation Feature without knowing it.
I came in to work on Friday to discover that no one could receive any mail, a condition that was causing no little consternation amongst the throngs shackled to their cubes and, after a careful bit of investigation by myself and the team lead, we determined What Apparently Went Wrong:

  1. We back up all of our DNS, DHCP and NIS server maps using CVS in order to keep ourselves from getting into a bad state with no easy way to back out damaging configuration changes. Somehow, our master DNS configuration file was partially overwritten so that any reference to a shared key (I’ll get to that) was removed.
  2. Our DNS tables are generated (mostly) on-the-fly by our DHCP server, which relieves us of a great deal of administrative burden. However, one can’t just have DHCP servers overwriting our DNS maps willy-nilly, a condition which we avoid by requiring access to a shared key that both DHCP and DNS can trust, thus allowing clients that are authorized in our NIS setup to request an IP from the DHCP server and have one assigned as well as have the DNS server updated.
  3. The DHCP server must be restarted/reloaded in order to read new ethers addresses from the NIS tables, which we accomplish thrice-hourly with a simple cronjob.
  4. Since the reference to the shared key was overwritten, the DHCP server was no longer able to force DNS updates, meaning that individual hosts began dropping from the DNS radar like flies.
  5. At around the same time that DNS began to fail, our primary mail server had a minor NIS hiccup that caused it to fail over to our secondary NIS server.
  6. All email addresses are fed through the NIS aliases map in order to tell the mail server who the intended recipient[s] are.
  7. Our secondary NIS server had recently been replaced with a newer, beefier box that was receiving all NIS map updates from the master server except aliases for causes not quite clear at this time, although much finger pointing was aimed in the direction of a faulty Makefile.
  8. Our mail server, unable to determine where to deliver mail, threw up its hands, spewed a whole bunch of “aliases: no such map” messages into the syslogs and contentedly queued up mail for the better part of a morning.
  9. All of which translated into: no mail for anyone until we figured this out.

This email fun was followed by a raging wave of thunderstorms that swept through the area, knocking out building power first (our compute center is UPS’d and generator-backed, so no worries there) and then knocking a transformer and some Verizon telcomm equipment offline, effectively nuking our external link and a sizable portion of our surrounding area, meaning no web access to end the day, followed by some incorrectly-configured Macs sitting on admins desks giving us heartburn for a goodly portion of the day as well. Wheee!
There's... something.  On. The wing!While Friday was fun, I came in to work fully expecting an easy day, as HGCDs (High Gremlin Count Days) are normally few and far between. However, ’twas not to be. I arrived to find my voicemail blinking and my boss standing in my office saying “Our web is down”. After running this statement through my Management-to-IT filters, I realized he was saying that no one could get to any external websites. I and the team lead poked around a bit before realizing that there is a bug in the newest version of RedHat Enterprise (the version our web proxy just happens to run) that ignores the specified default route when being run on machines with multiple NICs, such as proxy servers. This bug was triggered when our proxy, sensing a Disturbance In The DNS Force on Friday had run dhclient and thus begun ignoring the default route, resulting in our poor proxy having no idea how to get to the content that people were requesting of it. We manually added the default route and things once again moved to Status: Hunky Dory. Problem solved, at least for now.
As for me, I’m avoiding ladders, black cats and mirrors for the rest of the week, just to be safe.

2 Comments

/me smacks his forehead
I completely forgot about that. I really need to take a look at the OpenSSH source code and see what triggers that message…