Questio Verum: Heritrix - An archival quality crawler

Friday, June 23, 2006

Heritrix - An archival quality crawler

This week I’ve been experimenting with Heritrix, the Internet Archive’s web crawler. It has some functionality that Wget doesn’t provide including:

limiting the size of each file downloaded
allowing a crawl to be paused and the frontier to be examined and modified
following links in CSS and Flash
crawling multiple sites at the same time without invoking multiple instances of the crawler
storing crawls in an Arc file

Since Heritrix was built with Java and was pre-configured to run on a Linux system, I didn’t have to expend much effort to get it to run on Solaris. I untarred the distribution file, set a couple of environment variables, started the web server interface, and boom it was working.

The interface is not exactly intuitive, and a near complete reading of the entire manual is required to put together a decent crawl. Of course if you want to use sophisticated open-source software, you usually have to put in some significant effort to get it to work right. Thankfully, several of the developers (Michael Stack, Igor Ranitovic, and Gordon Mohr) have been very helpful in answering some of my newbie questions on the Heritrix list serve.

In learning about Heritrix, I’ve put together a page on Wikipedia. Hopefully the entry will drum up more general interest in Heritrix as well. I was really surprised no one had created the page before.

Questio Verum

Friday, June 23, 2006

Heritrix - An archival quality crawler

No comments:

Post a Comment