Monday, March 27, 2006

Warrick is gaining traction

On Friday I submitted a paper to HYPERTEXT ’06 about our web-repository crawler Warrick. The paper focuses on the difficulties of crawling web repositories like the Internet Archive and Google, and it presents an evaluation of three crawling policies. I really like the conference’s flexible 8-12 page ACM format. A few weeks ago when I was submitting a paper to SIGIR ’06, it was a real pain to get everything to fit on only 8 pages.

On that note, I have updated Warrick with the most recent changes:
  • Warrick now uses the Google API for accessing cached pages.
  • Warrick issues lister queries (queries using “site:” param) to Google using page scraping.
  • Yahoo API libraries were updated due to a March 2006 change.
  • Several minor bugs were corrected.
The biggest reason for integrating the Google API was because Warrick kept getting black-listed by Google after 150 or so queries. Michael suggested I write up my experiences in a technical report. It certainly is something that is going to influence researchers from now on who want to test Google for just about anything.

I also received several emails from the Internet Archive last week about Warrick. Apparently the guys that do backups for people with missing websites are excited about the tool, and IA will start informing users to use it:
If you are tech-savvy and know how to use command-line utilities, you can also refer to the Warrick tool here: and be sure to email the makers as they track who is using the tool. For this tool, a third party has put it together and we cannot guarantee the results. If you have questions about this tool, please refer your questions to the makers themselves.
One of the IA employees told me she has performed at least 200 recoveries for individuals in the past year. That’s a lot of people using “lazy preservation” and sure does support the need for research in this area.