Tuesday, January 10, 2006

Case Insensitive Crawling

What should a Web crawler do when it is crawling a website that is housed on a Windows web server and it comes across the following URLs:

http://foo.org/bar.html
http://foo.org/BAR.html

?

A very smart crawler would recognize that these URLs point to the same resource by comparing the ETags or even better, by examining the “Server:” HTTP header in the response. If the “Server:” contains

Server: Microsoft-IIS/5.1

or

Server: Apache/2.0.45 (Win32)

then it would know that the URLs refer to the same resource since Windows uses a case insensitive filesystem. If *nix is the web server’s OS then it can be safely assumed that the URLs are case sensitive. Other operating systems like Darwin (the open source UNIX-based foundation of Mac OS X) may use HFS+ which is a case insensitive filesystem. A danger of using such a filesystem with Apache’s directory protection is discussed here. Mac OS X could also use UFS which is case sensitive, and therefore a crawler cannot make a general rule for this OS to ignore case sensitive URLs.

Finding URLs in Google, MSN, and Yahoo from case insensitive websites is problematic. Consider the following URL:

http://www.harding.edu/user/fmccown/www/comp250/syllabus250.html

Searching for this URL in Google (using the “info:” parameter) will return back a “not found” page. But if the URL

http://www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html

is searched for, it will be found. Yahoo performs similarly. It will find the all-lowercase version of the URL but not the mixed-case version.

MSN takes the most flexible approach. All “url:” queries are performed in a case insensitive manner. They appear to take this strategy regardless of the website’s OS since a search for

url:http://www.cs.odu.edu/~FMCCOwn/

and

url:http://www.cs.odu.edu/~fmccown/

are both found by MSN even though the first URL is not valid (Unix filesystem). The disadvantage of this approach is what happens when bar.html and BAR.html are 2 different files on the web server. Would MSN only index one of the files?

The Internet Archive, like Google and Yahoo, is pinicky about case. The following URL is found:

http://web.archive.org/web/*/www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html


but this is not:

http://web.archive.org/web/*/www.harding.edu/user/fmccown/www/comp250/syllabus250.html


Update 3/30/06:

If you found this information interesting, you might want to check out my paper Evaluation of Crawling Policies for a Web-Repository Crawler which discusses these issues.

No comments:

Post a Comment