- The website’s root URL not redirect the crawler to a URL that is on a different host. If it does, the new URL should replace the old website URL.
- The website’s root URL should not appear to be a splash page for a website on a different host or indicate that the website has moved to a different host. If it does, the new URL should replace the old website URL.
- The website should not have all of its contents blocked by robots.txt. If some directories are blocked, that’s ok.
- The website’s root URL should not have a noindex/nofollow meta tag which would prevent a crawler from grabbing anything else on the website.
- The website should not have any more than 10K resources.
The restrictions seem very straightforward, but in practice they are very time consuming to enforce. Requirement 2 requires me to manually visit the site. Did I mention not all the sites are in English? That makes it even more difficult. Requirement 3 means I have to manually examine the robots.txt. Req. 4 requires manually examination of the root page, and req. 5 means I have to waste several days crawling a site before I know if it is too large or not.
I guess I could build a tool for req 3 and 4, but I’m not in the mood.
Anyway, I ended up making about 50 replacements (at least) and starting my crawls over again. Now I finally have 300 websites that meet my requirements.In the past I’ve used Wget to do my crawling, but I’ve decided to use Heritrix since it has quite a few useful features missing from Wget. But Heritrix isn’t perfect. I made a suggestion that Heritrix show the number of URLs from each host remaining in the frontier when examining a crawl report:
http://sourceforge.net/tracker/index.php?func=detail&aid=1533116&group_id=73833&atid=539102
The other difficulty with Heritrix is in extracting what you have crawled. I will need to write a script that will build an index into the ARC files so I can quickly extract data for a website. Since all the crawl data is merged into a series of ARC files, it is really difficult to throw away crawl data for a website you aren’t interested in. I could write a script to do it, but at this point it’s not worth my time.
Frank, here's a question for you that's sort of crawler-related.
ReplyDeleteI created a site for a client (www.spiritmedia.com) that has a new description in the metatags. Google correctly picks up the new metatags; Yahoo and MSN are still showing the old metadata after like, 6 months.
Do the other search engines not crawl/parse the metatags? I could see it in Yahoo's case because it's more a directory. However, I'm at a loss on MSN.
To be honest, I don't do a lot of SEO work - I try to make sites that have a good content-to-code ratio, and let the cards fall where they may.