August 2010

Monday, August 23, 2010

How does WHOIS work? A dirty guide

While devising my domain name suggestion tool I had to learn a lot about how WHOIS works in practice. In the interest of sharing knowledge I've written out what I learned.

WHOIS is a simple protocol to query the internet's database for ownership of domain names.

Simply connect to a whois server on port 43, enter a domain, and hit return. The server will print a couple response packets to your socket and the disconnect you.

Example (I typed in "hello.com" and hit return):

$ telnet whois.verisign-grs.com 43
Trying 199.7.52.74...
Connected to whois.verisign-grs.com.
Escape character is '^]'.
hello.com

Whois Server Version 2.0

Domain names in the .com and .net domains can now be registered
with many different competing registrars. Go to http://www.internic.net
for detailed information.

   Domain Name: HELLO.COM
   Registrar: MARKMONITOR INC.
   Whois Server: whois.markmonitor.com
   Referral URL: http://www.markmonitor.com
   Name Server: NS1.GOOGLE.COM
   Name Server: NS2.GOOGLE.COM
   Name Server: NS3.GOOGLE.COM
   Name Server: NS4.GOOGLE.COM
   Status: clientDeleteProhibited
   Status: clientTransferProhibited
   Status: clientUpdateProhibited
   Updated Date: 30-mar-2010
   Creation Date: 30-apr-1997
   Expiration Date: 01-may-2011

>>> Last update of whois database: Sun, 22 Aug 2010 16:52:25 UTC <<<

NOTICE: The expiration date displayed in this record is the date the
registrar's sponsorship of the domain name registration in the registry[.....]

Simple enough, right? I learned a lot of tricks about this process.

Every TLD is different

Every top-level-domain (e.g. .com, .net., .me, .name) handles whois differently: they have different central whois servers and different response formats.

dot-com is handled by verisign at whois.verisign-grs.com. The folks at whois-servers.net have helpfully provided cname aliases for the central whois server for every TLD. (That is to say, com.whois-servers.net will resolve to verisign's whois server for .com, and me.whois-servers.net will resolve to the whois server for the .me TLD, etc.)

There are several mirrors of verisign-grs.com, including whois.internic.net, whois.crsnic.net, and whois.nsiregistry.net, which must be obsolete addresses of competitors that verisign took over at some point. These names all resolve to addresses of the pattern 199.7.*.74 and I've found other whois servers responding in this address range operated by verisign that do not appear to be documented anywhere.

I discovered that verisign's whois server will always respond in the following way:

The first packet you receive contains always contains the header "Whois Server Version 2.0\n\n

Domain names in the .com and .net domains can now be registered with many different competing registrars. Go to http://www.internic.net for detailed information.\n"
The second packet contains the actual whois result, if any, followed by a long legal disclaimer

So I discovered that the most optimal way to make the request is to accept one packet, discard it, then read a little bit into the next packet to get the whois result, then disconnect before reading the ~2kb of legalese with every request. A little rude, perhaps, but you can be reasonably certain that that legal notice is never going change and it's just a waste of resources.

.com and .net require a two-stage lookup

dot-com and dot-net (both handled by verisign) are a little different than most other whois systems. The central whois server will tell you if a domain is registered, when it expires, and at what registrar it was registered at, but does not contain any information about who actually owns any domain. The whois record will contain a row that will tell you the whois server of the registrar with which the domain was registered, and you must then make a second whois request against the registrar's own whois server to get the information on the actual owner. Registrar whois servers follow the same request protocol but only contain any information about the domains controlled by that registrar.

I've never been able to get banned from verisign's whois servers (though supposedly it is possible), and I've always found it to be extremely fast (though I can tell some of the servers you get round-robin'ed to are faster than others). The various registrar whois servers are a different story: they vary widely in reliability and some have paranoid banning policies. Oh, and the response format different registrars use may not be the quite the same.

This two-stage design is understandable: given 90 million .com domain names are already registered, this makes the amount of data verisign has to maintain and serve out more manageable. By the way, you can apply for an account to download the entire zone file from verisign for .com and analogously for many other TLDs. dot-com's zone file is about 1 GB.

DNS is your friend

If you just want to tell whether a name is registered it's almost always a better deal to just do a DNS lookup against it. Certain record types like SOA or NS might have better abstract properties but A records are probably fastest considering it's more likely to be cached by a lower-tier DNS server. If you find any DNS record for a domain, it would have to be registered, and the vast majority of owned domains will have an A record set. So you can quickly discard domains that have DNS records as being registered.

Finally

Whois is a weird, wacky protocol. Like many other domaining issues, in practice it means knowledge of the whims and quirks of verisign's chosen behavior. At one point in 1999 the protocol was changed and all whois clients out there broke. This might happen again.

I haven't actually tried getting access to the zone file, but all of the above suggests a hybrid approach when checking a domain in order of expense:

Check the zone file first
Do a DNS lookup second
Look it up in the central whois
If found in the central whois, check the registrar whois

So that's all I've learned about whois. Enjoy!

Wednesday, August 18, 2010

Why semmyfun?

Because words like semantic are fun but semmy is more fun than semantic

UPDATE: also it's homophonous with semi-fun, which is semi-fun

Domains are inefficiently priced

The market for domains is an unusual one: domains have unusual characteristics as a priced good.

It's been bemoaned countless times in countless places the practice of domain squatting: seemingly every good domain is taken, and often by some domain squatter who isn't using the domain for any admirable purpose. I made a tool recently to find good available domains.

First off, these complaints are lies. Everyone who has ever complained about domain parking has a few domains they bought but are just sitting on without using them.

If you've ever thought about this issue for more than a minute, it's occurred to you that

"To combat squatting, they should raise the price of domains! If it cost $100 to own a domain 95% of squatting would go away!"

only to realize seconds later

"But then I would have to pay that much for my domains, including a few that I want to sit on and not use. I still want to pay $7!"

and then you put your head into your hands and moan about how unfair it all is.

Everyone is basically a hypocrite: domainers suck, but I still want to pay $7 for a domain and not necessarily do anything with it once I own it. It's a tragedy of the commons of intentions.

Domains are inefficiently priced.

There isn't really a good pricing mechanism for domain names.

there is virtually a limitless supply of domains, and more cannot be made.
once a domain is used, no one else can use it
every domain is unique
some domains are very high value, most are not
most domains are of some idiosyncratic value to a few rare individuals spread out across the earth

The quantity supplied of unused domains is practically infinite. The quantity supplied of each domain is 1.

As a result the classic functions of prices no longer, well, function:

No signaling mechanism. If someone buys a domain, it does not send a signal to the market that it should produce more domain names.
No transmission of preferences. Since every domain is unique and can only be used once, past domain market activity does not indicate what a price of any given domain should be in order to be efficient.

Since most domains without type-in traffic are of some unpredictable value to a small number of individuals, the best way to price these domains is with an auction mechanism. The efficient price for a short dictionary word should cost thousands of times more than an 8 syllable jokey domain, and the way to determine a price that at least serves a rationing function (apportioning domains to those who want it most) is to use an auction pricing mechanism.

The most important barrier to auction pricing mechanism is time value fluctuation. Everyone who wants a domain probably does not want it at the same time. Bob in Johannesburg wants a given domain in 2002, Hillary in Montevideo realizes she wants the same name in 2007. If they both had wanted the domain at the same time, we could auction it off between them.

So why doesn't Hillary just try to buy the domain off Bob, assuming he purchased it? Effectively domain buyers can just conduct one-off auctions between themselves and the current domain owner.

Well, why not? Buying used domains should be efficient, right?

Transaction costs. Buying a used domain means tracking down and trying to negotiate with some random weirdo literally half a world away. This is so time consuming and frustrating most will not bother attempting it. It's also given that a large percentage of domain owners make it difficult to contact them, fearing spam.
Endowment effects. Psychologically, when people get something that isn't even unique and special they demand unrealistically high prices to part with it once they own it. Simple studies have subjects choose a price for a coffee cup they don't own and then are given the same coffee cup to own and asked at what price they would be willing to part with it -- the second price they choose is massively higher.
Inequity aversion. Someone selling a domain doesn't want there to be some massive hidden value in a domain that they aren't pricing correctly; they may be reluctant to sell without a "fair" price but have absolutely no way of knowing what a fair price should be and may refuse to sell for this reason alone. For the same reason we may refuse to buy a domain at a good price since it seems so unfair that the current owner only paid $7 for it.

Endowment effects should be exacerbated if the current owner has invested emotionally or put time into owning the domain, but should be less of a problem for more sophisticated sellers.

Large-scale domain sellers should not be effected by to these last set of inefficiencies. Big time domain parkers make it blatantly obvious how to contact them to buy the domain, have no emotional investment in any particular domain they own and should know from experience what is a fair price. Smaller domain owners are the problem with used domains.

We all hate domain parkers because we want an inefficient price for ourselves but efficient prices for everyone else. Dot-com names should cost $1000 and there should be another TLD that costs $1/year for all of our hobby projects.

Sunday, August 1, 2010

Can you do it? A simple web app, start to finish in one week

It's useful in life to set challenging, yet achievable goals. After some recent setbacks I decided to recover my ego with this simple task: given a good, simple idea for a useful web app, go from idea to launch in one week.

I'm happy to say that I completed this goal. The result is: http://www.dotcomroulette.com/

Was this challenging? Of course. I intended to still do all the normal things I need to do during the week, like going to the gym and going to grocery store, and use up extra time that I would have spent puttering around doing nothing of use. I was largely able to accomplish this, but I did stay up late a couple nights. I counted two Wednesdays as one day and found bugs over the next few days, but sometimes you have to make success fit you rather that you fit success.

Several times I was set back my an unexpected obstacle by at least a day. The worst of these was WHOIS. My site is powered by Google App Engine. GAE is ideal in many ways but sandboxes you into a limited environment where the only network calls you can make are over HTTP. I had expected that someone had written some HTTP whois API that I could call to make queries. Instead I had difficulty finding any, and those that did exist were not reasonably priced for my needs since I had to make hundreds of whois queries for each name suggestion (so $0.01 per query is not going to be profitable for me, to say the least).

I ended up writing a threaded HTTP-WHOIS proxy server that I run somewhere else. This is a significant design limitation, since if it goes down the entire site is unusable. Given that both systems are uncorrelated this means of course a multiplicative increase in the failure odds (p(fail) = 1 - (p₁(fail) * p₂(fail)). It's also a single ugly bottleneck for all those clouded GAE instances out there, sort of defeating the point.

The frontend design also took at least 40% of the time. Design is not one of my strengths and this component of creating the site was a source of anxiety for me. Would I be able to make something passable in a short period of time? I ended up coming up with some that I liked, but it wasn't something that I felt 100% confident about in advance of doing it.

UI-wise I found that I had to rule out doing anything ambitious that I didn't more or less know how to do already in the interest of releasing quickly.

The algorithm was easy. I had this down in a couple hours. Including writing the entire bayes and markov engines from scratch. Goes to show how well I know this stuff now. I later went back and optimized it, for instance, by replacing lists of numbers with python's optimized array.array data structure.

I also cut several features that I felt 1) complicated the layout, forcing me to make decisions about the layout that I wasn't certain about making and would have fretted over endlessly, and 2) didn't add that much value at launch time. For instance, I wanted to have a sidebar that would pull in retweets from the twitter api. Sometimes it's kind of hard to think of key words when you're staring at the input box and I wanted a way for people to get ideas using examples that others had shared. Additionally people could get a little feedback reward from doing a retweet, which gives me the obvious benefit of getting exposure to that person's twitter followers. But I couldn't make up my mind as to where I would put that element so that it wouldn't be distracting for people looking at the page for the first time and I couldn't even test the feature because twitter apparently couldn't even index the .appspot.com name for the site. In the interest of getting a release out the door, I cut the feature.

What did I learn?

I learned that I can make a genuinely useful web app in a week.

What if I could do this every week?