Wednesday, August 09, 2006

Crawling the Web is very, very hard…

I’ve spent the past couple of weeks trying to randomly select 300 websites from a There were only a couple of restrictions I placed on the selection:
  1. The website’s root URL not redirect the crawler to a URL that is on a different host. If it does, the new URL should replace the old website URL.
  2. The website’s root URL should not appear to be a splash page for a website on a different host or indicate that the website has moved to a different host. If it does, the new URL should replace the old website URL.
  3. The website should not have all of its contents blocked by robots.txt. If some directories are blocked, that’s ok.
  4. The website’s root URL should not have a noindex/nofollow meta tag which would prevent a crawler from grabbing anything else on the website.
  5. The website should not have any more than 10K resources.

The restrictions seem very straightforward, but in practice they are very time consuming to enforce. Requirement 2 requires me to manually visit the site. Did I mention not all the sites are in English? That makes it even more difficult. Requirement 3 means I have to manually examine the robots.txt. Req. 4 requires manually examination of the root page, and req. 5 means I have to waste several days crawling a site before I know if it is too large or not.

I guess I could build a tool for req 3 and 4, but I’m not in the mood.

Anyway, I ended up making about 50 replacements (at least) and starting my crawls over again. Now I finally have 300 websites that meet my requirements.

In the past I’ve used Wget to do my crawling, but I’ve decided to use Heritrix since it has quite a few useful features missing from Wget. But Heritrix isn’t perfect. I made a suggestion that Heritrix show the number of URLs from each host remaining in the frontier when examining a crawl report:

Right now it is very difficult to tell if a host has been completely crawled or not. I would love to work on this myself, but I just don’t have the time right now. Maybe I'll get a student to work on this next time I'm teaching. ;)

The other difficulty with Heritrix is in extracting what you have crawled. I will need to write a script that will build an index into the ARC files so I can quickly extract data for a website. Since all the crawl data is merged into a series of ARC files, it is really difficult to throw away crawl data for a website you aren’t interested in. I could write a script to do it, but at this point it’s not worth my time.

Anyway, crawling is a very error-prone and difficult process, but I can’t wait to teach about it when I return to Harding! (I'm really getting the itch to get back in the classroom.)