Wednesday, April 22, 2009

Nutch, Sitemaps, and Google's findings

My search engine class is winding down, but our final project is to implement a Sitemap Protocol parser for Nutch, a popular open-source search engine. I mentioned a while back that Nutch is not for wimps... my students would certainly vouch for the huge learning curve to making code modifications. I've even had to scale back how much work my students do because of the complexity of changes required. I'm going to do the difficult part of integrating their code with the innards of Nutch sometime in the next few weeks.

The reason I mention our Sitemap project is that WWW 2009 is meeting in Madrid this week, and a paper entitled Sitemaps: Above and Beyond the Crawl of Duty is being presented today by Uri Schonfeld (UCLA) and Narayanan Shivakumar (Google). This is the first paper to report on widespread usage of Sitemaps in the Web using Google's crawling history.

Schonfeld & Shivakumar report that Sitemaps were used by approximately 35 million websites in late 2008, exposing several billion URLs. 58% of the URLs included last modification dates, 7% included change frequency, and 61% a priority. About 76.8% of Sitemaps used XML formatting, and only 3.4% used plain text. Interestingly, 17.5% of Sitemaps are formatted incorrectly.

The figure below represents how many URLs Google discovered via Sitemaps (red) vs. regular crawling (green) for cnn.com. Notice that on any given day, more URLs could normally be discovered via Sitemaps.



Another interesting figure (below) shows when a URL was discovered via Sitemaps vs. regular web crawling for cnn.com. In most cases URLs were discovered at the same rate, but there are a number of them (dots below the line) that were discovered via Sitemaps much earlier than web crawling.


CNN's website is not typical. Schonfeld & Shivakumar report that in a dataset of 5 billion+ URLs, 78% were discovered via Sitemaps first compared to 22% via web crawling.

The paper also describes an algorithm that can be used by search engines to prioritize URLs discovered via web crawling and Sitemaps as well. I've covered the high-lights, but I recommend you read the paper if you're interested in some of the finer details.