Wednesday, January 04, 2006

Pulling out of the search engine index

It looks like Craigslist has decided to block the crawling of its web site.

Webmasterworld.com also banned search engines from crawling their pages back in Nov 2005. It only took 2 days for all the pages to fall out of Google’s and MSN’s indexes. Interesting... not only do they ban all bots, they also use their robots.txt file to post blog entries.

Popular sites like these devote huge amounts of bandwidth to crawler traffic. This is why we are researching more efficient methods for search engines to index content from web servers by using OAI-PMH, a protocol which is very popular in the digital library world. Michael L. Nelson (Old Dominion University) and others are building an Apache module called mod_oai which allows a search engine to ask a web server "what new URLs do you have?" and "what content has changed since yesterday?" Widespread adoption of mod_oai could decrease Web traffic immensely and allow sites with bandwidth limitations to jump back into the search engine game.