Wednesday, May 20, 2009

Java Sitemap Parser

I've just released the Java Sitemap Parser on SourceForge.net. The software is capable of reading Sitemaps in XML, Atom, RSS, and text format. As far as I can tell, this is the first open source Sitemap-parsing software available on the Web.

The Java Sitemap Parser was the final project for my Search Engine Development class. I talked about the project a few weeks ago and how prevalent Sitemaps are becoming. Originally we wanted to add Sitemap support to Nutch, but developing just the parser proved to be quite a task. By releasing it as an independent project, I'm hoping Nutch, Heritrix, and other open-source crawlers will integrate it into their systems.