Monday, December 11, 2006

Link rot in CACM

I was really surprised this afternoon to see an article in the Communications of the ACM that cited a cached URL from the MSN search engine in place of a missing web page:
The link to the U.S. Secret Service “Operation 4-1-9” report at www.secretservice.gov/alert419.htm appeared to be broken when this column was written, but cached copies remain available (for example, cc.msnscache.com/cache.aspx?q=3910458378891〈=en-US).
Communications of the ACM, Volume 49, Number 12 (2006), Page 18.
The editors of CACM may not be aware of this, but search engines do not keep cached copies of pages long. In fact, they will often purge their caches of any web page that returns a 404 when crawling. (You can read my paper on an experiment which illustrates this.) Citing a cached page from a search engine should never be done in academic writing. Instead of citing just one broken URL, CACM has now cited two.

If you are interested in learning more about link rot and how to combat it, check out this Wikipedia article that I contribute to.

Speaking of link rot, Baden Hughes of the University of Melbourne has recently published a study entitled Link? Rot. URI Citation Durability in 10 Years of AusWeb Proceedings. (Not sure why he used a question mark in his title.) He used many of the methodologies that I used when examining link rot in D-Lib Magazine last year. Turns out AusWeb URLs have a much lower half-life (6 years) than D-Lib article URLs (10 years). This is probably because authors of D-Lib articles are more aware of link rot than authors in other professions.

I’m curious if any other on-line magazine or journal can beat D-Lib’s 10 year half-life. I have a suspicion JMIR articles could since many of them use WebCite.