Thursday, April 13, 2006

URL Canonicalization

The term URL canonicalization (also frequently called URL normalization) refers to the process that is performed on a URL to make it easier to tell if two syntactically different URLs are the same. For example, the URL

http://www.Harding.edu/USER/dsteil/www/abc/../index.htm

could be normalized to produce the canonical URL:

http://www.harding.edu/user/dsteil/www/

Search engines typically use different URL canonicalization policies which makes it difficult for Warrick to tell if URL x from MSN is the same as URL y from Google. I’ve noted some peculiarities in my blog here, here and here. Matt Cutts at Google also discussed some of their canonicalization policies back in Jan 2006.

I have not found much work in the literature about URL canonicalization/normalization. RFC 3986 has some standard normalization procedures that should be done. Pant et al. (2004) has a section about it in their chapter Crawling the Web from the book Web Dynamics. The first paper I’ve seen that deals with the issue head-on is by Sang Ho Lee et al. (2005) "On URL normalization".

I also checked Wikipedia and didn’t find anything about URL canonicalization. I decided to create a page about it and added a reference to it from the web crawler page. That was the first page I ever created on Wikipedia. Proverbs 25:2 – “It is the glory of God to conceal a thing; but the glory of kings is to search out a matter.” I’m no king, but I think God actually delights in our effort to learn about the great world He has created, and I appreciate Wikipedia providing a unique resource for us to consolidate and share our learning.