What should a Web crawler do when it is crawling a website that is housed on a Windows web server and it comes across the following URLs:
http://foo.org/bar.html
http://foo.org/BAR.html
?
A very smart crawler would recognize that these URLs point to the same resource by comparing the ETags or even better, by examining the “Server:” HTTP header in the response. If the “Server:” contains
Server: Microsoft-IIS/5.1
or
Server: Apache/2.0.45 (Win32)
then it would know that the URLs refer to the same resource since Windows uses a case insensitive filesystem. If *nix is the web server’s OS then it can be safely assumed that the URLs are case sensitive. Other operating systems like Darwin (the open source UNIX-based foundation of Mac OS X) may use HFS+ which is a case insensitive filesystem. A danger of using such a filesystem with Apache’s directory protection is discussed here. Mac OS X could also use UFS which is case sensitive, and therefore a crawler cannot make a general rule for this OS to ignore case sensitive URLs.
Finding URLs in Google, MSN, and Yahoo from case insensitive websites is problematic. Consider the following URL:
http://www.harding.edu/user/fmccown/www/comp250/syllabus250.html
Searching for this URL in Google (using the “info:” parameter) will return back a “not found” page. But if the URL
http://www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html
is searched for, it will be found. Yahoo performs similarly. It will find the all-lowercase version of the URL but not the mixed-case version.
MSN takes the most flexible approach. All “url:” queries are performed in a case insensitive manner. They appear to take this strategy regardless of the website’s OS since a search for
url:http://www.cs.odu.edu/~FMCCOwn/
and
url:http://www.cs.odu.edu/~fmccown/
are both found by MSN even though the first URL is not valid (Unix filesystem). The disadvantage of this approach is what happens when bar.html and BAR.html are 2 different files on the web server. Would MSN only index one of the files?
The Internet Archive, like Google and Yahoo, is pinicky about case. The following URL is found:
http://web.archive.org/web/*/www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html
but this is not:
http://web.archive.org/web/*/www.harding.edu/user/fmccown/www/comp250/syllabus250.html
Update 3/30/06:
If you found this information interesting, you might want to check out my paper Evaluation of Crawling Policies for a Web-Repository Crawler which discusses these issues.
No comments:
Post a Comment