- limiting the size of each file downloaded
- allowing a crawl to be paused and the frontier to be examined and modified
- following links in CSS and Flash
- crawling multiple sites at the same time without invoking multiple instances of the crawler
- storing crawls in an Arc file
The interface is not exactly intuitive, and a near complete reading of the entire manual is required to put together a decent crawl. Of course if you want to use sophisticated open-source software, you usually have to put in some significant effort to get it to work right. Thankfully, several of the developers (Michael Stack, Igor Ranitovic, and Gordon Mohr) have been very helpful in answering some of my newbie questions on the Heritrix list serve.
In learning about Heritrix, I’ve put together a page on Wikipedia. Hopefully the entry will drum up more general interest in Heritrix as well. I was really surprised no one had created the page before.