Fun In The Sun (Microsystems Server)

Or: RedHat Enterprise Linux’s `ypbind` Is Functionally Brain-Dead

WARNING/WARNUNG/ADVERTENCIA/AVERTISSEMENT: Geeky rant follows. If you don’t give a hoot about UNIX and/or Linux, you may just want to give this post a pass. -ed.
First, a little background: like many shops with a core infrastructure consisting of UNIX/*NIX servers of varying ages and configurations, we have run our network directory services using the venerable NIS directory technology provided by Sun Microsystems and implemented on nearly every single POSIX-compliant operating system on the planet. It is fast, well-understood, well-tested and generally easy to use (if set up properly). Our UNIX systems and desktops hum merrily along 99.9% of the time, blissfully confident in NIS’s ability to keep them happy and informed of the goings-on on the network. Our network is architected so that our primary (“master”) NIS server is supplemented by a lower-powered backup NIS “slave” server so that, in the event of a failure on our main server, the “slave” can take over and keep our NIS clients happy.
However, our secondary server has been having heartaches recently – apparently a patch from Sun that is supposed to prevent users from being able to overload the NIS server and cause it to

[…]prevent the ypserv(1M) NIS server process from answering NIS name service requests. A Denial of Service (DoS) may occur as clients currently bound to the NIS server may experience hangs or slow performance. Users may no longer be able to log in on affected NIS clients.

…is actually causing the server to die on its own. That’s right: we traded a potential DoS, instigated by users, for one that apparently triggers itself.
Now, this doesn’t cause an issue for Solaris clients; their NIS client software is intelligent enough to detect whether an NIS server process is running on a certain server and fail over to an alternate if said NIS server ever dies. RedHat’s (and perhaps other Linuxes’ – I don’t know because I haven’t tested other distros) NIS client isn’t this intelligent. Apparently, RH’s NIS setup uses `ping` to determine whether a server is still alive, which means that an NIS server process could die and, as long as the server hardware stayed active, Linux clients would continue to try to bind to a non-functional server, thus triggering a DoS on multiple systems. RH’s NIS client also uses `ping` to determine which NIS server to bind to; it functionally ignores the order set by DHCP servers and/or /etc/yp.conf and binds to whichever server provides the lowest latency.
All of this would be immaterial, but for one critical point: our primary server is connected into our network via a fiber optic gigabit link, while our secondary server runs on a gigabit copper link. To this point, copper networking equipment tends to have lower latencies than its fiber equivalents, which means that, you guessed it, our Linux clients were all persistently binding to the “slave” NIS server, regardless of its actual ability to serve up directory information. Thus, when the NIS processes would die on the “slave”, all of our stupid RedHat boxes would freeze, waiting for directory service on the part of a non-funcional box whose only claim to fame at the time was a functioning NIC.
Needless to say, we backed that patch out and, of course, everything’s happy again in Linux Land. Hooray for cascading failures!