Friday, February 10, 2006

Some thoughts on robots.txt

The Robots Exclusion Protocol has been around since June 1994, but there is no official standards body or RFC for the protocol. That leaves others free to tinker with it and add their own bells and whistles.

There are numerous limitations to robots.txt that have been noted (see Martijn Koster’s article). A few things that are lacking: ability to specify how frequently server requests should be made, the ideal times to make automated requests, permissions to visit vs. index vs. cache (make available in search engine caches).

According to Matt Cutts, Google supports the “Allow:” directive and wildcards (*) which are not part of the standard. The Google Sitemap team even developed a tool that can be used to ensure compliance with their robots.txt non-standard standard. Matt went on to comment that Google does not support a time delay between requests because some webmasters use values that would only allow Google to crawl 15-20 URLs in a day. Yahoo and MSN support this feature using a “Crawl-delay: XX” directive.

Well, I'm out of thoughts. :) Stay tuned…