[Chapter 8] 8.6 Coping with Disaster

8.6 Coping with Disaster

When disaster strikes, it really helps to know what to do. Knowing to duck under a sturdy table or desk during an earthquake can save you from being pinned under a toppling monitor. Knowing how to turn off your gas can save your house from conflagration.

Likewise, knowing what to do in a network disaster (or even just a minor mishap) can help you keep your network running. Living out in California, as we do, we have some experience and some suggestions.

8.6.1 Short Outages (Hours)

If your network is cut off from the outside world (whether "the outside world" is the rest of the Internet or the rest of your company), your name servers may start to have trouble resolving names. For example, if your domain, corp.acme.com, is cut off from the rest of the Acme Internet, you may not have access to your parent (acme.com) name servers, or to the root name servers.

You'd think this wouldn't impact communication between hosts in your local domain, but it can. For example, if you type:

% telnet selma.corp.acme.com

on a host running an older version of the resolver, the first domain name the resolver looks up will be selma.corp.acme.com.corp.acme.com (assuming your host is using the default search list - remember this from Chapter 6). The local domain name server, if it's authoritative for corp.acme.com, can tell that's not a kosher domain name. The following lookup, however, is for selma.corp.acme.com.acme.com. This prospective domain name is no longer in the corp.acme.com domain, so the query is sent to the acme.com name servers. Or rather your local name server tries to send the query there, and keeps retransmitting until it times out.

You can avoid this problem by making sure the first domain name the resolver looks up is the right one. Instead of typing:

% telnet selma.corp.acme.com

typing:

% telnet selma

or:

% telnet selma.corp.acme.com.

(note the trailing dot) will result in a lookup of selma.corp.acme.com first.

Note that BIND 4.9 and later resolvers don't have this problem, at least not by default. 4.9 and newer resolvers check the domain name as is first, as long as the name has more than one dot in it. So, if you tried:

% telnet selma.corp.acme.com

even without the trailing dot, the first name looked up would be selma.corp.acme.com.

If you are stuck running a 4.8.3 BIND or older resolver, you can avoid querying off-site name servers by taking advantage of the definable search list. You can use the search directive to define a search list that doesn't include your parent zone's domain name. For example, to work around the problem corp.acme.com is having, you could temporarily set your hosts' search lists to just:

search corp.acme.com

Now, when a user types:

% telnet selma.corp.acme.com

the resolver looks up selma.corp.acme.com.corp.acme.com first (which the local name server can answer), then selma.corp.acme.com, the correct domain name. And this works fine, too:

% telnet selma

works fine, too.

8.6.2 Longer Outages (Days)

If you lose network connectivity for a long time, your name servers may have other problems. If they lose connectivity to the root name servers for an extended period, they'll stop resolving queries outside their authoritative data. If the slaves can't reach their master, sooner or later they'll expire the zone.

In case your name service really goes haywire because of the connectivity loss, it's a good idea to keep a site-wide or workgroup /etc/hosts around. In times of dire need, you can move resolv.conf to resolv.bak, kill the local name server (if there is one), and just use /etc/hosts. It's not flashy, but it'll get you by.

As for slaves, you can reconfigure a slave that can't reach its master to run as a primary master. Just edit named.conf and change the type substatement in the zone statement from slave to master, then delete the master substatement. If more than one slave for the same zone is cut off, you can configure one as a primary master temporarily and reconfigure the other to load from the temporary primary.

Alternatively, you can just increase the expire time in all of your slaves' backup files and then signal the slaves to reload the files.

8.6.3 Really Long Outages (Weeks)

If an extended outage cuts you off from the Internet - say for a week or more - you may need to restore connectivity to root name servers artificially to get things working again. Every name server needs to talk to a root name server occasionally. It's a bit like therapy: the name server needs to contact the root to regain its perspective on the world.

To provide root name service during a long outage, you can set up your own root name servers, but only temporarily. Once you're reconnected to the Internet, you must shut off your temporary root servers. The most obnoxious vermin on the Internet are name servers that believe they're root name servers but don't know anything about most top-level domains. A close second is the Internet name server configured to query - and report - a false set of root name servers.

That said, and our alibis in place, here's what you have to do to configure your own root name server. First, you need to create a db.root file. The db.root file will delegate to the highest-level domain in your isolated network. For example, if movie.edu were to be isolated from the Internet, we might create a db.root file for terminator that looked like this:

. IN SOA terminator.movie.edu. al.robocop.movie.edu. (
                 1        ; Serial
                 10800    ; Refresh after 3 hours
                 3600     ; Retry after 1 hour
                 604800   ; Expire after 1 week
                 86400 )  ; Minimum TTL of 1 day

; Refresh, retry and expire really don't matter, since all
; roots are primaries.  Minimum TTL could be longer, since
; the data are likely to be stable.

  IN NS terminator.movie.edu. ; terminator is the temp. root

; Our root only knows about movie.edu and our two
; in-addr.arpa domains

movie.edu. 86400 IN NS terminator.movie.edu.
           86400 IN NS wormhole.movie.edu.

249.249.192.in-addr.arpa. 86400 IN NS terminator.movie.edu.
                          86400 IN NS wormhole.movie.edu.

253.253.192.in-addr.arpa. 86400 IN NS terminator.movie.edu.
                          86400 IN NS wormhole.movie.edu.

terminator.movie.edu. 86400 IN A 192.249.249.3
wormhole.movie.edu.   86400 IN A 192.249.249.1
                      86400 IN A 192.253.253.1

Then we need to add the appropriate line to terminator's named.conf file:

// Comment out hints zone
// zone . {
//              type hint;
//                      file "db.cache";
//              };

zone                    "."     {
                type master;
                file "db.root";
};

Or, for BIND 4's named.boot file:

; cache    .   db.cache  (comment out the cache directive)
primary  .   db.root

We then update all of our name servers (except the new, temporary root) with a db.cache file that includes just the temporary root (best to move the old cache file aside - we'll need it later, once connectivity is restored).

Here are the contents of the file db.cache:

.  99999999  IN  NS  terminator.movie.edu.

terminator.movie.edu.  IN  A  192.249.249.3

That will keep movie.edu name resolution going during the outage. Then, once Internet connectivity is restored, we can delete the zone statement from named.conf on terminator, and restore the original cache files on all our other name servers.


8.5 Planning for Disasters		9. Parenting