The reason I mention our Sitemap project is that WWW 2009 is meeting in Madrid this week, and a paper entitled Sitemaps: Above and Beyond the Crawl of Duty is being presented today by Uri Schonfeld (UCLA) and Narayanan Shivakumar (Google). This is the first paper to report on widespread usage of Sitemaps in the Web using Google's crawling history.
Schonfeld & Shivakumar report that Sitemaps were used by approximately 35 million websites in late 2008, exposing several billion URLs. 58% of the URLs included last modification dates, 7% included change frequency, and 61% a priority. About 76.8% of Sitemaps used XML formatting, and only 3.4% used plain text. Interestingly, 17.5% of Sitemaps are formatted incorrectly.
The figure below represents how many URLs Google discovered via Sitemaps (red) vs. regular crawling (green) for cnn.com. Notice that on any given day, more URLs could normally be discovered via Sitemaps.

Another interesting figure (below) shows when a URL was discovered via Sitemaps vs. regular web crawling for cnn.com. In most cases URLs were discovered at the same rate, but there are a number of them (dots below the line) that were discovered via Sitemaps much earlier than web crawling.

CNN's website is not typical. Schonfeld & Shivakumar report that in a dataset of 5 billion+ URLs, 78% were discovered via Sitemaps first compared to 22% via web crawling.
The paper also describes an algorithm that can be used by search engines to prioritize URLs discovered via web crawling and Sitemaps as well. I've covered the high-lights, but I recommend you read the paper if you're interested in some of the finer details.
Harding keeps adding all of these neat classes - makes me wish I could go back and take these new ones!
ReplyDelete