- The website’s root URL not redirect the crawler to a URL that is on a different host. If it does, the new URL should replace the old website URL.
- The website’s root URL should not appear to be a splash page for a website on a different host or indicate that the website has moved to a different host. If it does, the new URL should replace the old website URL.
- The website should not have all of its contents blocked by robots.txt. If some directories are blocked, that’s ok.
- The website’s root URL should not have a noindex/nofollow meta tag which would prevent a crawler from grabbing anything else on the website.
- The website should not have any more than 10K resources.
The restrictions seem very straightforward, but in practice they are very time consuming to enforce. Requirement 2 requires me to manually visit the site. Did I mention not all the sites are in English? That makes it even more difficult. Requirement 3 means I have to manually examine the robots.txt. Req. 4 requires manually examination of the root page, and req. 5 means I have to waste several days crawling a site before I know if it is too large or not.
I guess I could build a tool for req 3 and 4, but I’m not in the mood.Anyway, I ended up making about 50 replacements (at least) and starting my crawls over again. Now I finally have 300 websites that meet my requirements.
In the past I’ve used Wget to do my crawling, but I’ve decided to use Heritrix since it has quite a few useful features missing from Wget. But Heritrix isn’t perfect. I made a suggestion that Heritrix show the number of URLs from each host remaining in the frontier when examining a crawl report:
Right now it is very difficult to tell if a host has been completely crawled or not. I would love to work on this myself, but I just don’t have the time right now. Maybe I'll get a student to work on this next time I'm teaching. ;)
Maybe I'll get a student to work on this next time I'm teaching. ;)
The other difficulty with Heritrix is in extracting what you have crawled. I will need to write a script that will build an index into the ARC files so I can quickly extract data for a website. Since all the crawl data is merged into a series of ARC files, it is really difficult to throw away crawl data for a website you aren’t interested in. I could write a script to do it, but at this point it’s not worth my time.
Anyway, crawling is a very error-prone and difficult process, but I can’t wait to teach about it when I return to Harding! (I'm really getting the itch to get back in the classroom.)
(I'm really getting the itch to get back in the classroom.)