Friday, October 06, 2006

Become.com's web crawler

Today a member of the Heritrix list serve pointed everyone to an article on Sun’s website that discusses Become.com’s web crawler. The article dates back to August of 2005, so it’s a little dated. I couldn’t find any updated information on the crawler, but apparently it is proprietary, and the source code will likely never see the light of day.

Become.com actually developed 2 crawlers in 2004- one written entirely in Java and the other mostly Java with some C++. The article states that the crawlers "may be the most sophisticated, massively scaled Java technology application in existence."

The article doesn’t mention anything about Heritrix, a crawler which is also completely written in Java. Although Heritrix doesn’t currently have a distributed architecture, it could still be deployed in such an environment. It would be really interesting to see the two crawlers compete at the National Java Crawling Championships.