Friday, June 23, 2006

Heritrix - An archival quality crawler

This week I’ve been experimenting with Heritrix, the Internet Archive’s web crawler. It has some functionality that Wget doesn’t provide including:
  • limiting the size of each file downloaded
  • allowing a crawl to be paused and the frontier to be examined and modified
  • following links in CSS and Flash
  • crawling multiple sites at the same time without invoking multiple instances of the crawler
  • storing crawls in an Arc file
Since Heritrix was built with Java and was pre-configured to run on a Linux system, I didn’t have to expend much effort to get it to run on Solaris. I untarred the distribution file, set a couple of environment variables, started the web server interface, and boom it was working.

The interface is not exactly intuitive, and a near complete reading of the entire manual is required to put together a decent crawl. Of course if you want to use sophisticated open-source software, you usually have to put in some significant effort to get it to work right. Thankfully, several of the developers (Michael Stack, Igor Ranitovic, and Gordon Mohr) have been very helpful in answering some of my newbie questions on the Heritrix list serve.

In learning about Heritrix, I’ve put together a page on Wikipedia. Hopefully the entry will drum up more general interest in Heritrix as well. I was really surprised no one had created the page before.