As you’ll note, there was very little blog output ’round these parts two days ago (read: I didn’t post a single durn thing). This was entirely due to a very, very long 16 hour day which featured a large-scale LAN outage. The root cause was a bit of a heat problem we had about a week ago causing various and sundry hard drives to misbehave in highly passive-aggressive fashions, meaning reboots all-around were required to shake out the cobwebs. Only when we rebooted one of our biggest servers, it didn’t come back.
The root disk, which we have mirrored via Sun’s Volume Manager (used to be Solstice Disk Suite) was reporting that both halves of the root mirror “
Need[ed] Maintenance“. Additionally, we were experiencing an error that read thusly:
Error: svc:/system/filesystem/root:default failed to mount /usr (see 'svcs -x' for details)
[ system/filesystem/root:default failed fatally (see 'svcs -x' for details) ]
Requesting System Maintenance Mode
Console login service(s) cannot run
Here’s the funny thing: we don’t break the
/usr partition out separately. All core/root directories are mounted in the root partition, thus this sort of message was confusing, to say the least.
We called Sun for support, as the
metasync d0 recommended by Sun’s output did diddly squat and attempts to boot from either side of the root mirror only ended in failure. I sat on the phone with Sun engineers for the better part of 5 hours, desperately searching for an answer. Sunsolve/Docs.Sun/internal Sun engineering documentation revealed nothing for either myself or the valiant support staff. Finally, while on hold yet again, in frustration I Googled the error we were receiving and came across this posting in which another user’s system was exhibiting extremely similar symptoms. Turns out that they were missing a newline at the end of
/etc/vfstab and thus, by remounting the root partition as rw (instead of the ro that is the default for maintenance mode) and issuing an
echo >> /etc/vfstab, they were able to get the system back up and booting. Beyond desperate, I emulated the behavior and, lo and behold!, the system booted. Several sets of metadevices needed syncing, but the root came up cleanly and we were back in business.
Needless to say, my Sun case engineer swore that he was going to document the case so that the next unfortunate soul that simply neglects to end a system-critical file with a stinkin’ newline can be quickly and efficiently told what to do.
Or: RedHat Enterprise Linux’s `ypbind` Is Functionally Brain-Dead
WARNING/WARNUNG/ADVERTENCIA/AVERTISSEMENT: Geeky rant follows. If you don’t give a hoot about UNIX and/or Linux, you may just want to give this post a pass. -ed.
First, a little background: like many shops with a core infrastructure consisting of UNIX/*NIX servers of varying ages and configurations, we have run our network directory services using the venerable NIS directory technology provided by Sun Microsystems and implemented on nearly every single POSIX-compliant operating system on the planet. It is fast, well-understood, well-tested and generally easy to use (if set up properly). Our UNIX systems and desktops hum merrily along 99.9% of the time, blissfully confident in NIS’s ability to keep them happy and informed of the goings-on on the network. Our network is architected so that our primary (“master”) NIS server is supplemented by a lower-powered backup NIS “slave” server so that, in the event of a failure on our main server, the “slave” can take over and keep our NIS clients happy.
However, our secondary server has been having heartaches recently – apparently a patch from Sun that is supposed to prevent users from being able to overload the NIS server and cause it to
[…]prevent the ypserv(1M) NIS server process from answering NIS name service requests. A Denial of Service (DoS) may occur as clients currently bound to the NIS server may experience hangs or slow performance. Users may no longer be able to log in on affected NIS clients.
…is actually causing the server to die on its own. That’s right: we traded a potential DoS, instigated by users, for one that apparently triggers itself.
Now, this doesn’t cause an issue for Solaris clients; their NIS client software is intelligent enough to detect whether an NIS server process is running on a certain server and fail over to an alternate if said NIS server ever dies. RedHat’s (and perhaps other Linuxes’ – I don’t know because I haven’t tested other distros) NIS client isn’t this intelligent. Apparently, RH’s NIS setup uses `ping` to determine whether a server is still alive, which means that an NIS server process could die and, as long as the server hardware stayed active, Linux clients would continue to try to bind to a non-functional server, thus triggering a DoS on multiple systems. RH’s NIS client also uses `ping` to determine which NIS server to bind to; it functionally ignores the order set by DHCP servers and/or
/etc/yp.conf and binds to whichever server provides the lowest latency.
All of this would be immaterial, but for one critical point: our primary server is connected into our network via a fiber optic gigabit link, while our secondary server runs on a gigabit copper link. To this point, copper networking equipment tends to have lower latencies than its fiber equivalents, which means that, you guessed it, our Linux clients were all persistently binding to the “slave” NIS server, regardless of its actual ability to serve up directory information. Thus, when the NIS processes would die on the “slave”, all of our stupid RedHat boxes would freeze, waiting for directory service on the part of a non-funcional box whose only claim to fame at the time was a functioning NIC.
Needless to say, we backed that patch out and, of course, everything’s happy again in Linux Land. Hooray for cascading failures!