Wednesday, July 19, 2006

That’s not a cache... that’s an archive!

This morning I stumbled across a March 2006 blog posting by Danny Sullivan entitled “25 Things I Hate About Google”. Sullivan’s opinions carry a lot of weight in the search engine world, and so I started to sweat when I saw number 9 on his list:

9. Stop caching pages: I was all for opt-out with cached pages until a court gave you far more right to reprint anything than anyone could have expected. Now you've got to make it opt-in. You helped create the caching mess by just assuming it was legal to reprint web pages online without asking, using opt-out as your cover. Now you've had that backed up legally, but that doesn't make it less evil.

Sullivan doesn’t agree with the January 2006 Nevada federal court ruling that declared Google’s cached pages did not constitute copyright infringement, thereby okaying the opt-out policy used by search engines using the noarchive meta-tag. Sullivan and others make some good points in the forum discussing the ruling, showing where the ruling may have some flaws.

One the arguments opponents of the ruling make is that a search engine cache is hardly a cache in the traditional sense because pages are cached long after they are changed or deleted from a web server. One of the posts by mcanerin gives an example of a web page that had been cached for almost 2 years (the example is no longer accessible). In mcanerin’s words: “That's not a cache, it's an archive.”

The fact that the cache is more like an archive is exactly what makes it beneficial to most Web users, and that’s why I think the court’s judgment was fair. Search engine caches are a huge public good. Caching is not evil. Yes, there may be a few scenarios where caching may not work to everyone’s benefit, but in most cases the good far outweighs the bad. As long as search engines provide a mechanism to keep crawled content from being cached and to remove cached content immediately if needed, then there is no really compelling reason to force search engines to use an opt-in policy. (Yes, I know it can be a real pain to manually remove entries from many search engines, but how often does anyone really need to do that?)

My research on digital preservation of websites relies heavily on the wide-spread use of search engine caching, and if caching turns from an opt-out to opt-in, I am going to be in serious trouble, and so are users of Warrick. I’ll be keeping my eye on this…