Monday, October 08, 2007

Crawling behavior of Google, MSN, and Yahoo

Thanks, Marko, for sending me a link to this study analyzing the crawling behavior of the big three search engines. The experiment involved setting up a synthetic collection of web pages arranged as a binary search tree. They monitored the crawl log and performed search engine queries for a period of one year (2005-4-13 to 2006-4-13).

The photo below is a visualization of Yahoo's crawling behavior (Yahoo was the most active of the three crawlers). You can see an animation of the tree growing each day here.


We performed a similar experiment at ODU back in 2005 except we removed web pages every day from our collections to see how long they would stay cached by the search engines. And Joan, a fellow Ph.D. student at ODU who worked on the previous experiment, has been performing another experiment over the last several months with a very deep and wide collection of pages. (You can find links to the pages at the bottom-right corner of this page labeled Joan1-4).

What I found really interesting about this experiment was the methodology used to created the web pages. They altered the pages by appending the crawl log of the pages and allowing anyone (including spam bots) to make comments that appeared on the pages. This kept the page contents unique and changing to entice more crawls.

I really hope the authors of the study will submit their findings for publication in a peer-reviewed conference or journal.