Questio Verum: April 2006

Saturday, April 22, 2006

iclnet.org is back up!

I just noticed that ICLnet is back up. This is the first website that I recovered with Warrick that used the recovered resources to go back on-line. The WWW 2006 conference website www2006.org was the first truly "lost" website that I reconstructed, but I don't know if they actually used the recovered files to bring the website back on-line. I noted a few weeks ago there was another website that was recovered using Warrick, but I didn't assist with that recovery. There still may be someone else that has used Warrick without my knowledge to recover their site. The Warrick download log shows quite a few downloads over the past few months.

WebCite

This week I was reading about link rot (Wikipedia) when I stumbled across a tool called WebCite for creating archives of web pages that are cited in an academic work. From their website:

WebCite® is an archiving system for webreferences (cited webpages and websites), which can be used by authors, editors, and publishers of scholarly papers and books, to ensure that cited webmaterial will remain available to readers in the future.

The canonical reference for WebCite appears to be a 2005 article in the Journal of Medical Internet Research (JMIR) entitled “Going, Going, Still There: Using the WebCite Service to Permanently Archive Cited Web Pages” by Eysenbach and Trudel. I also found a 2003 poster about the service, so it appears to have been around for a while.

WebCite is a great idea for combating link rot, although other archiving services like Spurl.net and Hanzo:web could also be used. The advantage to WebCite is that they also provide "impact statistics" on cited web pages.

I did a search for “WebCite” in Google Scholar to see if WebCite had been widely adopted since I had never seen it used before. The only articles I could find that used the system were from JMIR which I assume has a policy that enforces use of WebCite for all their articles. Here’s an example of a WebCite URL:

http://www.webcitation.org/426

I may use WebCite the next time I write an article. The only thing I’m concerned about is the long-term survival of WebCite. For several days I was unable to access their website. If their service is not entirely stable, it makes me wonder how long they’ll be around.

Thursday, April 13, 2006

Candidacy Exam is over

Last Friday (Apr 7) I completed my Candidacy Exam (proposal defense) without any difficulties. It was nice having my wife in attendance along with a few other friends/PhD students. Now I’m ABD there’s nothing keeping me from graduating except that measly dissertation. A few days before the exam I had my hopes dashed when one of my papers was rejected from a conference. I won’t go into another rant but will keep shopping the paper around until it finds a good home. Gotta keep positive: maybe Google will unexpectedly stumble across my work like they did with Ori Allen and offer me millions for Warrick. ;) In the meantime, Becky and I are heading to the OBX for some R&R.

Bill Arms, who served on my committee, gave a really great talk after my exam about the Cornell Web Library. He published an article about it in D-Lib Magazine (same issue as our paper on crawler activity) and has a more technical paper about it accepted to JCDL 06. The library is based on the collections from the Internet Archive, and it will give researchers the ability to perform Web research much easier than it is today. We may be able to use the library to perform some work with Warrick since it contains a number of lost websites.

URL Canonicalization

The term URL canonicalization (also frequently called URL normalization) refers to the process that is performed on a URL to make it easier to tell if two syntactically different URLs are the same. For example, the URL

http://www.Harding.edu/USER/dsteil/www/abc/../index.htm

could be normalized to produce the canonical URL:

http://www.harding.edu/user/dsteil/www/

Search engines typically use different URL canonicalization policies which makes it difficult for Warrick to tell if URL x from MSN is the same as URL y from Google. I’ve noted some peculiarities in my blog here, here and here. Matt Cutts at Google also discussed some of their canonicalization policies back in Jan 2006.

I have not found much work in the literature about URL canonicalization/normalization. RFC 3986 has some standard normalization procedures that should be done. Pant et al. (2004) has a section about it in their chapter Crawling the Web from the book Web Dynamics. The first paper I’ve seen that deals with the issue head-on is by Sang Ho Lee et al. (2005) "On URL normalization".

I also checked Wikipedia and didn’t find anything about URL canonicalization. I decided to create a page about it and added a reference to it from the web crawler page. That was the first page I ever created on Wikipedia. Proverbs 25:2 – “It is the glory of God to conceal a thing; but the glory of kings is to search out a matter.” I’m no king, but I think God actually delights in our effort to learn about the great world He has created, and I appreciate Wikipedia providing a unique resource for us to consolidate and share our learning.

Sunday, April 02, 2006

Warrick reconstructs JaysRomanHistory.com

On Mar 31, I received an email from an individual who had successfully used Warrick to reconstruct JaysRomanHistory.com. The cool thing was that he was able to reconstruct the site without getting any help from me. I have reconstructed a couple of websites on behalf of others, but this is the ~~first~~ third site I am aware of where someone ran Warrick on their own to reconstruct a website.

A couple of quotes from their site:

Welcome! This website has been put back on the Internet by friends of Jay King, the original author, who died unexpectedly in 2005. We didn't want Jay's excellent reference site to be lost forever because it no longer had a home on the Internet at SJSU.

and

This site has been selected as a valuable educational Internet resource for Discovery Channel School.

Update on 4/30/06:

This week I received an email from a Carter R., a webmaster who had used Warrick back in Jan 2006 to reconstruct two of his sites when the hard drive of his personally-maintained web server crashed:

http://dckickball.org/
http://cubanlinks.org/

He writes about using Warrick in his blog entries:

http://cubanlinks.org/blog/articles/2006/01/17/im-back-sort-of
http://cubanlinks.org/blog/articles/2006/01/20/getting-there

From Carter's blog:

One bright spot has been the recovery of my content via a tool called Warrick that uses various caching services and APIs from Google, Yahoo, the Internet Archive and others to reconstruct lost websites. So far, I’ve recovered posts for Cubanlinks going all the way back to its first post in 2002...

... I’ll describe the rebuilding process in more detail as I go along. The main point that I want to get across is this: BACK UP YOUR DATA!. The shock of losing a year’s worth of blood and sweat (regarding the code that powered DCKickball) still has yet to fully sink in. Don’t pull a Carter.

Although Warrick wasn't able to recover all of Carter's websites, he seemed pretty thankful for what he was able to get back:

It’s unclear how many posts never got recovered with Warrick in the first place. Eyeballing it, I’d say I have at least 80% of my posts. And you know what? I’ll take that.

These sites are definitely the first to be reconstructed with Warrick without my help.

Saturday, April 01, 2006

Wikipedia, the study aid

In six days I’ll be defending my Ph.D. proposal. In our department, it’s called a Candidacy Exam, and it requires me to not only show that I am fully knowledgeable in my research area but also very knowledgeable in every area of computer science. The committee gets to ask me any question they want and expects a well-informed response.

In preparation for the exam, I have come to realize just how invaluable Wikipedia is for a study tool. I’ve also become somewhat addicted to updating resources in my field of study. I recently updated entries on digital libraries, OAI-PMH, and digital preservation. I also found a comprehensive section on the Churches of Christ; I’m a member of this church and learned some things I never even knew about it!

A recent study published in Nature showed that the accuracy of information found in Wikipedia is nearly equal to information found in Britannica. Britannica responded to the findings with many criticisms, pointing out that many articles in Wikipedia are poorly written and give too much attention to controversial scientific theories. I have personally found many Wikipedia articles to be very readable and somewhat complete, at least in the areas of computer science. Considering the accuracy level is not too far off from Britannica, I consider it invaluable for any student needing a crash course in the field.