Tuesday, January 10, 2006

Case Insensitive Crawling

What should a Web crawler do when it is crawling a website that is housed on a Windows web server and it comes across the following URLs:



A very smart crawler would recognize that these URLs point to the same resource by comparing the ETags or even better, by examining the “Server:” HTTP header in the response. If the “Server:” contains

Server: Microsoft-IIS/5.1


Server: Apache/2.0.45 (Win32)

then it would know that the URLs refer to the same resource since Windows uses a case insensitive filesystem. If *nix is the web server’s OS then it can be safely assumed that the URLs are case sensitive. Other operating systems like Darwin (the open source UNIX-based foundation of Mac OS X) may use HFS+ which is a case insensitive filesystem. A danger of using such a filesystem with Apache’s directory protection is discussed here. Mac OS X could also use UFS which is case sensitive, and therefore a crawler cannot make a general rule for this OS to ignore case sensitive URLs.

Finding URLs in Google, MSN, and Yahoo from case insensitive websites is problematic. Consider the following URL:


Searching for this URL in Google (using the “info:” parameter) will return back a “not found” page. But if the URL


is searched for, it will be found. Yahoo performs similarly. It will find the all-lowercase version of the URL but not the mixed-case version.

MSN takes the most flexible approach. All “url:” queries are performed in a case insensitive manner. They appear to take this strategy regardless of the website’s OS since a search for




are both found by MSN even though the first URL is not valid (Unix filesystem). The disadvantage of this approach is what happens when bar.html and BAR.html are 2 different files on the web server. Would MSN only index one of the files?

The Internet Archive, like Google and Yahoo, is pinicky about case. The following URL is found:


but this is not:


Update 3/30/06:

If you found this information interesting, you might want to check out my paper Evaluation of Crawling Policies for a Web-Repository Crawler which discusses these issues.