Friday, January 06, 2006

Reconstructing Websites with Warrick

What happens when your hard drive crashes, the backups you meant to make are nowhere to be found, and your website has now disappeared from the Web? Or what happens when your web hosting company has a fire, and all their backups of your website go up in flames? When such a calamity occurs, an obvious place to look for a backup of your website is at the Internet Archive. Unfortunately they don’t have the resources to archive every website out there. A not so obvious place to look is in the caches that search engines like Google, MSN, and Yahoo make available.


My research focuses on recovering lost websites, and my research group has recently created a tool called Warrick which can reconstruct a website by pulling missing resources from the Internet Archive, Google, Yahoo, and MSN. We have published some of our results using Warrick in a technical report that you can view at arXiv.org.

Warrick is currently undergoing some modifications as we get ready to perform a new batch of website reconstructions. Hopefully I’ll have a stable version of Warrick available for download soon.

Update on 3/20/07:

Warrick has been made available (for quite some time) here and our initial experiments were formally published in Lazy Preservation: Reconstructing Websites by Crawling the Crawlers (WIDM 2006).