2010

Saturday, October 9, 2010

My config secrets

Warning: unix geeky post ahead. You have been warned.

There are a couple of config options I always set up on a fresh account. These settings tend to blow people's minds (slightly) when they see them. So I'll share them here now for posterity.

There are two things, or I should say, there are two bits of config in two different user config files each that I always use, and they're just insanely useful but no one else to my knowledge has ever discovered them. Get ready.

mysql client config

If you're like me, or are like my former self, you may have a bunch of mysql clients open to a bunch of different mysql servers and databases.

Every single one of these windows looks like:

mysql>

If you're like me it's awfully hard to remember which client is connected to which server. I often would log out of a client only to log back in again because I didn't know where I was connected.

I'm about to solve this problem for you forever with a little config change.

It's little known that you can create a user config for the mysql client and the other mysql util clients. Create a file called .my.cnf in your user directory. The mysql client uses the same config parser as the server so you have to create a section called [mysql] in order for it to pick up the settings.

Add the following line:

prompt="\u@\h/\d> "

When you log in using the mysql client again, it will look something like this:

myuser@10.3.2.99/reports>

\u is user, \d is database, of course. You can find all of the options near the end of this page of the manual.

The second bit of config I put in the user mysql config is to set a pager. Sometimes if you execute a large query it will flood your terminal, exploding your buffer and maybe making your terminal unresponsive, frightening your children, etc.

The way you can fix this is to set a pager. You can set it to less but then it will page all of your queries, even the very short ones, and this quickly gets annoying. The secret is the -FX flags to less, which will tell it to exit and print everything to stdout if the buffer is less than one window high, which behaves beautifully. Add pager="less -FX" to your config.

Your complete ~/.my.cnf should look like:

[mysql]
prompt="\u@\h/\d> "
pager="less -FX"

readline

There's this library that bash, mysql and a million other command line shells use called readline. The python shell can use it but it requires some other jiggering that I don't feel like looking up right now. Anyway what's convenient about readline is that you can configure it in one place and all of your different shell programs will obey the same config.

This config should be placed at ~/.inputrc .

First, add the line:

set completion-ignore-case on

This will enable enhanced auto-complete, so if you type cd docu and press tab it will autocomplete Documents even though it is in a different case. It's kind of silly that bash should be case-sensitive on a case-insensitive file system.

Next, the more interesting part:

"\C-b" backward-word
"\C-f" forward-word

These settings will make it so that control-B will move the cursor back a whole word and control-F will move it forward a whole word. By default these key combinations just move it forward or backword a single character, perhaps vestigial from an era where many keyboards didn't have arrow keys. Being able to move about by word is much more useful especially if you're working with a long command or sql statement. There's lots of other configuration you can give readline like deleting words, adding surrounding quotes, but this is all I usually do. You should already know that control-A and control-E will move you to the beginning and end of line in almost all readline-enabled shell programs.

So those are my config secrets. Insanely useful, yet seemingly no-one seems aware they are even possible. I hope they embiggen your unix computing experience.

Thursday, September 16, 2010

Get that 4 letter domain you've always wanted: include one digit

In our last experiment we attempted attempted to estimate the proportion of pronounceable 5, 6 and 7-letter domains that are already registered.

But what about 4 letter domains?

I wrote a series of random domain generators to test different character distributions of 4 letter domains. (For all experiments n=500 and p=~0.0, chi-square test; all domains are dot-com.)

First, I assumed that pronounceability would not be a factor, and generated 500 domains of consisting of 4 random letters. The results were what I expected:

random a-z : 100%

All 500 random domains that I generated were registered. I let it go for a half-hour or so and generated thousands of random domains and was not able to find one that was unregistered.

So, all 4 letter .com's are registered then?

Of course not. The secret? Digits!

Domains can also contain characters in the range 0-9. So, I tested a second domain generator that would produce random 4 character domains consisting of one digit in any position and 3 other characters that could be digits or letters chosen randomly from the set [a-z, 0-9].

The result:

random a-z, 0-9 with at least one digit: 22.4%

only 22.4% were registered! So, if you want a 4 letter domain, use a digit.

This got me thinking. Any of the domains I generated could have from 1-4 digits. What if I controlled the number of digits?

1 digit, 3 letters: 16.2%
2 digits, 2 letters: 24.6%
3 digits, 1 letter: 30.6%
4 digits: 100%

The results surprised me. Apparently the optimal number of digits to include is 1, and the more digits you have the more likely it is to be registered. In fact, it appears that someone has registered every dot-com combination of 4 digits.

The story is not over yet, though: what if you include a hyphen? I tried several experiments with 3 letters or digits plus a hyphen to find out.

3 random letters a-z + hyphen: 60.6%
3 random characters a-z, 0-9 + hypen: 17.4%
3 random digits 0-9 + hypen: 48%

Including a hyphen does not beat the 1 digit + 3 letter domain space but comes close. Interestingly domains in the digits + hypen set were dramatically more likely to be registered than the set of digits + characters + hyphen.

So, contrary to what you might believe, it turns out there are plenty of available 4 letter domains.

There are 456,976 possible 4 letter combinations of the letters a-z (26^4), and there are 703,040 possible combinations of 3 letters a-z plus one digit 0-9 ((10 * 26^3) * 4). Assuming the 16.2% proportion is safe to extrapolate on, there should be 589,147 unregistered one digit + 3 letter domains, more than the total number of possible 4 letter a-z domains. Popular wisdom suggests that attaining a 4 letter dot-com is nearly impossible. These results suggest that is not the case, if you're willing to include a digit.

Finally, I attempted one last experiment: sample the complete possible space for domains, letters, digits and hyphens chosen randomly 1 out of 36 or 1 out of 37 for each character (hyphens cannot lead or end a domain). Results:

any legal combination of letters, digits and hyphens: 43.0%

Out of the 1,774,224 possible 4 letter dot-com domains (36^2 * 37^2), actually less than half are registered.

update: Well, this post has generated significant interest. Check out my domain name generator, and here are some useful registrar coupon codes: for godaddy use code FALL99, and namecheap use BACK2REALITY. Let me know if they work.

Tuesday, September 14, 2010

What is the probability that a pronounceable domain is registered?

I saw someone griping on Hacker News that all the short, pronounceable domains are taken.

I realized that I knew how to quantify this frustration. One, how many pronounceable domains exist for a given number of characters? Two, given a finite answer to One, what is the probability that a domain in this space is registered?

I recently made a tool to find pronounceable domains, partially using Markov techniques. I started with a published frequency table of letter trigrams in English and simply brute-forced my way to find every possible overlapping combination of trigrams that would fit in a given length of characters. I assumed that a domain had to start with a valid starting trigram but could end at the length limit with any trigram. This process produces a set of some real but mostly nonsense words that should be pronounceable to English speakers. All results are limited to dot-com domains.

The results:

For length = 5, there are 265,722 total possible pronounceable domains
For length = 6, there are 1,702,669 total possible pronounceable domains
For length = 7, there are 10,843,465 total possible pronounceable domains

Next I randomly sampled from these spaces to figure out the percentage of the possible dot-com domains for each length that are already registered. The results:

For length = 5, of 481 sampled from 265,722 possible names, 92.52% were registered
For length = 6, of 941 sampled from 1,702,669 possible names, 45.48% were registered
For length = 7, of 1906 sampled from 10,843,465 possible names, 12.8% were registered

So 92.5% of pronounceable 5 letter dot-com domains are registered, 45.48% were registered for 6 letter domains, and only 12.8% of seven letter domains were taken. (By the way, I've never been able to find an alphabetical 4 letter domain that is not already registered.) Eyeballing the results it did appear that the registered domains did tend to have a better "ring" to them than those that were not, and seven-letter nonsense words often had no ring to them at all, but automatically assessing domain quality was not attempted in this analysis.

This should be a number on our frustrations, and maybe a year from now I can run the same script and see how the numbers have changed.

The area of unregistered pronounceable 5 letter domains, representing 19,876 names, is barely visible. This is plotted according to the absolute number of domains on a linear scale. The next chart shows the percentage available per length:

To give you an idea of what the output of this process looks like, I reproduced some output for each character length below.

len = 5	len = 6	len = 7
`ANTON.com registered ANTOP.com registered ANCLE.com registered ANCRA.com registered ANCOM.com registered ANGLA.com registered ANGRO.com registered ANSED.com registered ANSEC.com registered ANNER.com registered ANORT.com registered ANORD.com registered ANORR.com available ANOME.com registered ANONA.com registered ANONO.com registered ANEDI.com registered ANEST.com registered ANEEN.com registered ANECA.com registered ANETI.com registered`	`ANDENO.com available ANDEMA.com registered ANDITI.com available ANDARN.com available ANTERR.com available ANTESS.com available ANCESI.com available ANCERO.com registered ANCLEM.com available ANGLEG.com registered ANSTON.com registered ANSENI.com available ANICER.com registered ANNICE.com registered ANNOWI.com available ANORNE.com available ANONES.com available ANECON.com registered ANETAC.com registered ANELOC.com available ANELDE.com available`	`ANDEAST.com available ANTORRI.com available ANTOWEA.com available ANTARRE.com available ANTABIT.com available ANTHARI.com available ANTHIRE.com registered ANTUTHE.com available ANCEDEP.com available ANCHATH.com available ANCHITA.com available ANCHREL.com available ANCLEAK.com available ANCRIME.com available ANCROAT.com available ANGERRO.com available ANGEDIT.com available ANGICIE.com available ANGRANO.com available ANSITUT.com available ANSIMPE.com available`

Monday, August 23, 2010

How does WHOIS work? A dirty guide

While devising my domain name suggestion tool I had to learn a lot about how WHOIS works in practice. In the interest of sharing knowledge I've written out what I learned.

WHOIS is a simple protocol to query the internet's database for ownership of domain names.

Simply connect to a whois server on port 43, enter a domain, and hit return. The server will print a couple response packets to your socket and the disconnect you.

Example (I typed in "hello.com" and hit return):

$ telnet whois.verisign-grs.com 43
Trying 199.7.52.74...
Connected to whois.verisign-grs.com.
Escape character is '^]'.
hello.com

Whois Server Version 2.0

Domain names in the .com and .net domains can now be registered
with many different competing registrars. Go to http://www.internic.net
for detailed information.

   Domain Name: HELLO.COM
   Registrar: MARKMONITOR INC.
   Whois Server: whois.markmonitor.com
   Referral URL: http://www.markmonitor.com
   Name Server: NS1.GOOGLE.COM
   Name Server: NS2.GOOGLE.COM
   Name Server: NS3.GOOGLE.COM
   Name Server: NS4.GOOGLE.COM
   Status: clientDeleteProhibited
   Status: clientTransferProhibited
   Status: clientUpdateProhibited
   Updated Date: 30-mar-2010
   Creation Date: 30-apr-1997
   Expiration Date: 01-may-2011

>>> Last update of whois database: Sun, 22 Aug 2010 16:52:25 UTC <<<

NOTICE: The expiration date displayed in this record is the date the
registrar's sponsorship of the domain name registration in the registry[.....]

Simple enough, right? I learned a lot of tricks about this process.

Every TLD is different

Every top-level-domain (e.g. .com, .net., .me, .name) handles whois differently: they have different central whois servers and different response formats.

dot-com is handled by verisign at whois.verisign-grs.com. The folks at whois-servers.net have helpfully provided cname aliases for the central whois server for every TLD. (That is to say, com.whois-servers.net will resolve to verisign's whois server for .com, and me.whois-servers.net will resolve to the whois server for the .me TLD, etc.)

There are several mirrors of verisign-grs.com, including whois.internic.net, whois.crsnic.net, and whois.nsiregistry.net, which must be obsolete addresses of competitors that verisign took over at some point. These names all resolve to addresses of the pattern 199.7.*.74 and I've found other whois servers responding in this address range operated by verisign that do not appear to be documented anywhere.

I discovered that verisign's whois server will always respond in the following way:

The first packet you receive contains always contains the header "Whois Server Version 2.0\n\n

Domain names in the .com and .net domains can now be registered with many different competing registrars. Go to http://www.internic.net for detailed information.\n"
The second packet contains the actual whois result, if any, followed by a long legal disclaimer

So I discovered that the most optimal way to make the request is to accept one packet, discard it, then read a little bit into the next packet to get the whois result, then disconnect before reading the ~2kb of legalese with every request. A little rude, perhaps, but you can be reasonably certain that that legal notice is never going change and it's just a waste of resources.

.com and .net require a two-stage lookup

dot-com and dot-net (both handled by verisign) are a little different than most other whois systems. The central whois server will tell you if a domain is registered, when it expires, and at what registrar it was registered at, but does not contain any information about who actually owns any domain. The whois record will contain a row that will tell you the whois server of the registrar with which the domain was registered, and you must then make a second whois request against the registrar's own whois server to get the information on the actual owner. Registrar whois servers follow the same request protocol but only contain any information about the domains controlled by that registrar.

I've never been able to get banned from verisign's whois servers (though supposedly it is possible), and I've always found it to be extremely fast (though I can tell some of the servers you get round-robin'ed to are faster than others). The various registrar whois servers are a different story: they vary widely in reliability and some have paranoid banning policies. Oh, and the response format different registrars use may not be the quite the same.

This two-stage design is understandable: given 90 million .com domain names are already registered, this makes the amount of data verisign has to maintain and serve out more manageable. By the way, you can apply for an account to download the entire zone file from verisign for .com and analogously for many other TLDs. dot-com's zone file is about 1 GB.

DNS is your friend

If you just want to tell whether a name is registered it's almost always a better deal to just do a DNS lookup against it. Certain record types like SOA or NS might have better abstract properties but A records are probably fastest considering it's more likely to be cached by a lower-tier DNS server. If you find any DNS record for a domain, it would have to be registered, and the vast majority of owned domains will have an A record set. So you can quickly discard domains that have DNS records as being registered.

Finally

Whois is a weird, wacky protocol. Like many other domaining issues, in practice it means knowledge of the whims and quirks of verisign's chosen behavior. At one point in 1999 the protocol was changed and all whois clients out there broke. This might happen again.

I haven't actually tried getting access to the zone file, but all of the above suggests a hybrid approach when checking a domain in order of expense:

Check the zone file first
Do a DNS lookup second
Look it up in the central whois
If found in the central whois, check the registrar whois

So that's all I've learned about whois. Enjoy!

Wednesday, August 18, 2010

Why semmyfun?

Because words like semantic are fun but semmy is more fun than semantic

UPDATE: also it's homophonous with semi-fun, which is semi-fun

Domains are inefficiently priced

The market for domains is an unusual one: domains have unusual characteristics as a priced good.

It's been bemoaned countless times in countless places the practice of domain squatting: seemingly every good domain is taken, and often by some domain squatter who isn't using the domain for any admirable purpose. I made a tool recently to find good available domains.

First off, these complaints are lies. Everyone who has ever complained about domain parking has a few domains they bought but are just sitting on without using them.

If you've ever thought about this issue for more than a minute, it's occurred to you that

"To combat squatting, they should raise the price of domains! If it cost $100 to own a domain 95% of squatting would go away!"

only to realize seconds later

"But then I would have to pay that much for my domains, including a few that I want to sit on and not use. I still want to pay $7!"

and then you put your head into your hands and moan about how unfair it all is.

Everyone is basically a hypocrite: domainers suck, but I still want to pay $7 for a domain and not necessarily do anything with it once I own it. It's a tragedy of the commons of intentions.

Domains are inefficiently priced.

There isn't really a good pricing mechanism for domain names.

there is virtually a limitless supply of domains, and more cannot be made.
once a domain is used, no one else can use it
every domain is unique
some domains are very high value, most are not
most domains are of some idiosyncratic value to a few rare individuals spread out across the earth

The quantity supplied of unused domains is practically infinite. The quantity supplied of each domain is 1.

As a result the classic functions of prices no longer, well, function:

No signaling mechanism. If someone buys a domain, it does not send a signal to the market that it should produce more domain names.
No transmission of preferences. Since every domain is unique and can only be used once, past domain market activity does not indicate what a price of any given domain should be in order to be efficient.

Since most domains without type-in traffic are of some unpredictable value to a small number of individuals, the best way to price these domains is with an auction mechanism. The efficient price for a short dictionary word should cost thousands of times more than an 8 syllable jokey domain, and the way to determine a price that at least serves a rationing function (apportioning domains to those who want it most) is to use an auction pricing mechanism.

The most important barrier to auction pricing mechanism is time value fluctuation. Everyone who wants a domain probably does not want it at the same time. Bob in Johannesburg wants a given domain in 2002, Hillary in Montevideo realizes she wants the same name in 2007. If they both had wanted the domain at the same time, we could auction it off between them.

So why doesn't Hillary just try to buy the domain off Bob, assuming he purchased it? Effectively domain buyers can just conduct one-off auctions between themselves and the current domain owner.

Well, why not? Buying used domains should be efficient, right?

Transaction costs. Buying a used domain means tracking down and trying to negotiate with some random weirdo literally half a world away. This is so time consuming and frustrating most will not bother attempting it. It's also given that a large percentage of domain owners make it difficult to contact them, fearing spam.
Endowment effects. Psychologically, when people get something that isn't even unique and special they demand unrealistically high prices to part with it once they own it. Simple studies have subjects choose a price for a coffee cup they don't own and then are given the same coffee cup to own and asked at what price they would be willing to part with it -- the second price they choose is massively higher.
Inequity aversion. Someone selling a domain doesn't want there to be some massive hidden value in a domain that they aren't pricing correctly; they may be reluctant to sell without a "fair" price but have absolutely no way of knowing what a fair price should be and may refuse to sell for this reason alone. For the same reason we may refuse to buy a domain at a good price since it seems so unfair that the current owner only paid $7 for it.

Endowment effects should be exacerbated if the current owner has invested emotionally or put time into owning the domain, but should be less of a problem for more sophisticated sellers.

Large-scale domain sellers should not be effected by to these last set of inefficiencies. Big time domain parkers make it blatantly obvious how to contact them to buy the domain, have no emotional investment in any particular domain they own and should know from experience what is a fair price. Smaller domain owners are the problem with used domains.

We all hate domain parkers because we want an inefficient price for ourselves but efficient prices for everyone else. Dot-com names should cost $1000 and there should be another TLD that costs $1/year for all of our hobby projects.

Sunday, August 1, 2010

Can you do it? A simple web app, start to finish in one week

It's useful in life to set challenging, yet achievable goals. After some recent setbacks I decided to recover my ego with this simple task: given a good, simple idea for a useful web app, go from idea to launch in one week.

I'm happy to say that I completed this goal. The result is: http://www.dotcomroulette.com/

Was this challenging? Of course. I intended to still do all the normal things I need to do during the week, like going to the gym and going to grocery store, and use up extra time that I would have spent puttering around doing nothing of use. I was largely able to accomplish this, but I did stay up late a couple nights. I counted two Wednesdays as one day and found bugs over the next few days, but sometimes you have to make success fit you rather that you fit success.

Several times I was set back my an unexpected obstacle by at least a day. The worst of these was WHOIS. My site is powered by Google App Engine. GAE is ideal in many ways but sandboxes you into a limited environment where the only network calls you can make are over HTTP. I had expected that someone had written some HTTP whois API that I could call to make queries. Instead I had difficulty finding any, and those that did exist were not reasonably priced for my needs since I had to make hundreds of whois queries for each name suggestion (so $0.01 per query is not going to be profitable for me, to say the least).

I ended up writing a threaded HTTP-WHOIS proxy server that I run somewhere else. This is a significant design limitation, since if it goes down the entire site is unusable. Given that both systems are uncorrelated this means of course a multiplicative increase in the failure odds (p(fail) = 1 - (p₁(fail) * p₂(fail)). It's also a single ugly bottleneck for all those clouded GAE instances out there, sort of defeating the point.

The frontend design also took at least 40% of the time. Design is not one of my strengths and this component of creating the site was a source of anxiety for me. Would I be able to make something passable in a short period of time? I ended up coming up with some that I liked, but it wasn't something that I felt 100% confident about in advance of doing it.

UI-wise I found that I had to rule out doing anything ambitious that I didn't more or less know how to do already in the interest of releasing quickly.

The algorithm was easy. I had this down in a couple hours. Including writing the entire bayes and markov engines from scratch. Goes to show how well I know this stuff now. I later went back and optimized it, for instance, by replacing lists of numbers with python's optimized array.array data structure.

I also cut several features that I felt 1) complicated the layout, forcing me to make decisions about the layout that I wasn't certain about making and would have fretted over endlessly, and 2) didn't add that much value at launch time. For instance, I wanted to have a sidebar that would pull in retweets from the twitter api. Sometimes it's kind of hard to think of key words when you're staring at the input box and I wanted a way for people to get ideas using examples that others had shared. Additionally people could get a little feedback reward from doing a retweet, which gives me the obvious benefit of getting exposure to that person's twitter followers. But I couldn't make up my mind as to where I would put that element so that it wouldn't be distracting for people looking at the page for the first time and I couldn't even test the feature because twitter apparently couldn't even index the .appspot.com name for the site. In the interest of getting a release out the door, I cut the feature.

What did I learn?

I learned that I can make a genuinely useful web app in a week.

What if I could do this every week?