Monday, August 23, 2010

How does WHOIS work? A dirty guide

While devising my domain name suggestion tool I had to learn a lot about how WHOIS works in practice. In the interest of sharing knowledge I've written out what I learned.

WHOIS is a simple protocol to query the internet's database for ownership of domain names.

Simply connect to a whois server on port 43, enter a domain, and hit return. The server will print a couple response packets to your socket and the disconnect you.

Example (I typed in "hello.com" and hit return):

$ telnet whois.verisign-grs.com 43
Trying 199.7.52.74...
Connected to whois.verisign-grs.com.
Escape character is '^]'.
hello.com


Whois Server Version 2.0


Domain names in the .com and .net domains can now be registered
with many different competing registrars. Go to http://www.internic.net
for detailed information.


   Domain Name: HELLO.COM
   Registrar: MARKMONITOR INC.
   Whois Server: whois.markmonitor.com
   Referral URL: http://www.markmonitor.com
   Name Server: NS1.GOOGLE.COM
   Name Server: NS2.GOOGLE.COM
   Name Server: NS3.GOOGLE.COM
   Name Server: NS4.GOOGLE.COM
   Status: clientDeleteProhibited
   Status: clientTransferProhibited
   Status: clientUpdateProhibited
   Updated Date: 30-mar-2010
   Creation Date: 30-apr-1997
   Expiration Date: 01-may-2011


>>> Last update of whois database: Sun, 22 Aug 2010 16:52:25 UTC <<<


NOTICE: The expiration date displayed in this record is the date the
registrar's sponsorship of the domain name registration in the registry[.....]

Simple enough, right? I learned a lot of tricks about this process.

Every TLD is different

Every top-level-domain (e.g. .com, .net., .me, .name) handles whois differently: they have different central whois servers and different response formats.

dot-com is handled by verisign at whois.verisign-grs.com. The folks at whois-servers.net have helpfully provided cname aliases for the central whois server for every TLD. (That is to say, com.whois-servers.net will resolve to verisign's whois server for .com, and me.whois-servers.net will resolve to the whois server for the .me TLD, etc.)

There are several mirrors of verisign-grs.com, including whois.internic.net, whois.crsnic.net, and whois.nsiregistry.net, which must be obsolete addresses of competitors that verisign took over at some point. These names all resolve to addresses of the pattern 199.7.*.74 and I've found other whois servers responding in this address range operated by verisign that do not appear to be documented anywhere.

I discovered that verisign's whois server will always respond in the following way:
  • The first packet you receive contains always contains the header "Whois Server Version 2.0\n\n

    Domain names in the .com and .net domains can now be registered with many different competing registrars. Go to http://www.internic.net for detailed information.\n"
  • The second packet contains the actual whois result, if any, followed by a long legal disclaimer
So I discovered that the most optimal way to make the request is to accept one packet, discard it, then read a little bit into the next packet to get the whois result, then disconnect before reading the ~2kb of legalese with every request. A little rude, perhaps, but you can be reasonably certain that that legal notice is never going change and it's just a waste of resources.

.com and .net require a two-stage lookup

dot-com and dot-net (both handled by verisign) are a little different than most other whois systems. The central whois server will tell you if a domain is registered, when it expires, and at what registrar it was registered at, but does not contain any information about who actually owns any domain. The whois record will contain a row that will tell you the whois server of the registrar with which the domain was registered, and you must then make a second whois request against the registrar's own whois server to get the information on the actual owner. Registrar whois servers follow the same request protocol but only contain any information about the domains controlled by that registrar.

I've never been able to get banned from verisign's whois servers (though supposedly it is possible), and I've always found it to be extremely fast (though I can tell some of the servers you get round-robin'ed to are faster than others). The various registrar whois servers are a different story: they vary widely in reliability and some have paranoid banning policies. Oh, and the response format different registrars use may not be the quite the same.

This two-stage design is understandable: given 90 million .com domain names are already registered, this makes the amount of data verisign has to maintain and serve out more manageable. By the way, you can apply for an account to download the entire zone file from verisign for .com and analogously for many other TLDs. dot-com's zone file is about 1 GB.

DNS is your friend

If you just want to tell whether a name is registered it's almost always a better deal to just do a DNS lookup against it. Certain record types like SOA or NS might have better abstract properties but A records are probably fastest considering it's more likely to be cached by a lower-tier DNS server. If you find any DNS record for a domain, it would have to be registered, and the vast majority of owned domains will have an A record set. So you can quickly discard domains that have DNS records as being registered.

Finally

Whois is a weird, wacky protocol. Like many other domaining issues, in practice it means knowledge of the whims and quirks of verisign's chosen behavior. At one point in 1999 the protocol was changed and all whois clients out there broke. This might happen again.

I haven't actually tried getting access to the zone file, but all of the above suggests a hybrid approach when checking a domain in order of expense:

  1. Check the zone file first
  2. Do a DNS lookup second
  3. Look it up in the central whois
  4. If found in the central whois, check the registrar whois
So that's all I've learned about whois. Enjoy!