Thursday, January 18, 2007

Store your data in a search engine cache

I have say it… Google is the best thing since indoor pluming. This morning I was wondering if anyone has been writing about my program Warrick, so I did a quick search of “warrick mccown”, just to see what would pop up. On the second page of results, I found a link to a paper that was published in October 2006: Using Free Web Storage for Data Backup.

The paper was written by some researchers from Stony Brook University who have developed two backup systems: CrawlBackup for storing files in a search engine’s cache, and MailBackup for storing files in the mailboxes of Internet email providers. Their work is remarkably similar to ours, and it almost makes me wonder if our place is bugged.

This paper is the first to actually cite Warrick. The paper also cites an interesting blog posting from Dec 2005: How the Google Cache can save Your A$$. OK, not the best title in the world, but it’s the only pseudo-article I've found where someone has documented using the Google cache to recover a lost website. In this case the guy accidentally deleted 30 articles from his website and used Google’s cache to recover them. It was just a few months earlier that I had finished work on Warrick which could have automated the process for him (at least he only had to recover 30 pages!). He also used the Internet Archive to recover a client’s website a few years ago.

So I’m really glad to have found these related resources. What’s unfortunate is that finding related work is often much more difficult than a simple Google search (or even a Google Scholar search). Google may produce a few gold nuggets, but it also produces a lot of false positives: why is the third result a production chart for Shaun Alexander (I really don't need to be reminded of the Cowboys loss in the playoffs)? The word Warrick is only used once, and it’s in a drop-down list box! And lest we forget, Google does not have the entire Web indexed. If I really wanted to be diligent I'd also use MSN, Yahoo, or a metasearcher like Dogpile.

Anyway, I’m still hoping someday for a Google SuperScholar system that takes all my papers, notes, etc. and figures out what is most related on the Web and in every digital library in existence and sends me weekly updates with precise summaries of why the information found is relevant. Maybe it should be called Google ScholarHeaven.
smile