Thursday, September 16, 2010

Get that 4 letter domain you've always wanted: include one digit

In our last experiment we attempted attempted to estimate the proportion of pronounceable 5, 6 and 7-letter domains that are already registered.

But what about 4 letter domains?

I wrote a series of random domain generators to test different character distributions of 4 letter domains. (For all experiments n=500 and p=~0.0, chi-square test; all domains are dot-com.)

First, I assumed that pronounceability would not be a factor, and generated 500 domains of consisting of 4 random letters. The results were what I expected:

  • random a-z : 100%

All 500 random domains that I generated were registered. I let it go for a half-hour or so and generated thousands of random domains and was not able to find one that was unregistered.

So, all 4 letter .com's are registered then?

Of course not. The secret? Digits!

Domains can also contain characters in the range 0-9. So, I tested a second domain generator that would produce random 4 character domains consisting of one digit in any position and 3 other characters that could be digits or letters chosen randomly from the set [a-z, 0-9]. 

The result:

  • random a-z, 0-9 with at least one digit: 22.4%

only 22.4% were registered! So, if you want a 4 letter domain, use a digit.

This got me thinking. Any of the domains I generated could have from 1-4 digits. What if I controlled the number of digits?

  • 1 digit, 3 letters: 16.2%
  • 2 digits, 2 letters: 24.6%
  • 3 digits, 1 letter: 30.6%
  • 4 digits: 100%

The results surprised me. Apparently the optimal number of digits to include is 1, and the more digits you have the more likely it is to be registered. In fact, it appears that someone has registered every dot-com combination of 4 digits.

The story is not over yet, though: what if you include a hyphen? I tried several experiments with 3 letters or digits plus a hyphen to find out.

  • 3 random letters a-z + hyphen: 60.6%
  • 3 random characters a-z, 0-9 + hypen: 17.4%
  • 3 random digits 0-9 + hypen: 48%

Including a hyphen does not beat the 1 digit + 3 letter domain space but comes close. Interestingly domains in the digits + hypen set were dramatically more likely to be registered than the set of digits + characters + hyphen.

So, contrary to what you might believe, it turns out there are plenty of available 4 letter domains.

There are 456,976 possible 4 letter combinations of the letters a-z (26^4), and there are 703,040 possible combinations of 3 letters a-z plus one digit 0-9 ((10 * 26^3) * 4). Assuming the 16.2% proportion is safe to extrapolate on, there should be 589,147 unregistered one digit + 3 letter domains, more than the total number of possible 4 letter a-z domains. Popular wisdom suggests that attaining a 4 letter dot-com is nearly impossible. These results suggest that is not the case, if you're willing to include a digit.

Finally, I attempted one last experiment: sample the complete possible space for domains, letters, digits and hyphens chosen randomly 1 out of 36 or 1 out of 37 for each character (hyphens cannot lead or end a domain). Results:
  • any legal combination of letters, digits and hyphens: 43.0%

Out of the 1,774,224 possible 4 letter dot-com domains (36^2 * 37^2), actually less than half are registered.

update: Well, this post has generated significant interest. Check out my domain name generator, and here are some useful registrar coupon codes: for godaddy use code FALL99, and namecheap use BACK2REALITY. Let me know if they work.

Tuesday, September 14, 2010

What is the probability that a pronounceable domain is registered?

I saw someone griping on Hacker News that all the short, pronounceable domains are taken.

I realized that I knew how to quantify this frustration. One, how many pronounceable domains exist for a given number of characters? Two, given a finite answer to One, what is the probability that a domain in this space is registered?

I recently made a tool to find pronounceable domains, partially using Markov techniques. I started with a published frequency table of letter trigrams in English and simply brute-forced my way to find every possible overlapping combination of trigrams that would fit in a given length of characters. I assumed that a domain had to start with a valid starting trigram but could end at the length limit with any trigram. This process produces a set of some real but mostly nonsense words that should be pronounceable to English speakers. All results are limited to dot-com domains.

The results:

  • For length = 5, there are 265,722 total possible pronounceable domains
  • For length = 6, there are 1,702,669 total possible pronounceable domains
  • For length = 7, there are 10,843,465 total possible pronounceable domains

Next I randomly sampled from these spaces to figure out the percentage of the possible dot-com domains for each length that are already registered. The results:

  • For length = 5, of 481 sampled from 265,722 possible names, 92.52% were registered
  • For length = 6, of 941 sampled from 1,702,669 possible names, 45.48% were registered
  • For length = 7, of 1906 sampled from 10,843,465 possible names, 12.8% were registered

So 92.5% of pronounceable 5 letter dot-com domains are registered, 45.48% were registered for 6 letter domains, and only 12.8% of seven letter domains were taken. (By the way, I've never been able to find an alphabetical 4 letter domain that is not already registered.) Eyeballing the results it did appear that the registered domains did tend to have a better "ring" to them than those that were not, and seven-letter nonsense words often had no ring to them at all, but automatically assessing domain quality was not attempted in this analysis.

This should be a number on our frustrations, and maybe a year from now I can run the same script and see how the numbers have changed.

The area of unregistered pronounceable 5 letter domains, representing 19,876 names, is barely visible. This is plotted according to the absolute number of domains on a linear scale. The next chart shows the percentage available per length:

To give you an idea of what the output of this process looks like, I reproduced some output for each character length below.

len = 5len = 6len = 7 registered registered registered registered registered registered registered registered registered registered registered registered available registered registered registered registered registered registered registered registered available registered available available available available available registered available registered registered available registered registered available available available registered registered available available available available available available available available registered available available available available available available available available available available available available available available