Monday, January 09, 2006

Missing index.html

When Google, MSN, and Yahoo are indexing your website, the question arises, are these 2 URLs equivalent:

http://foo.org/dir/

and

http://foo.org/dir/index.html

?

For example, consider the URL http://www.cs.odu.edu/~fmccown/. The web server is configured to return the index.html file when accessed. The following URL will access the same resource: http://www.cs.odu.edu/~fmccown/index.html

A crawler could discover that these resources are the same by doing a comparison of the content or simply comparing the ETag (if present).

The problem is that not all requests ending in ‘/’ are actually requesting the index.html file. The web server could be configured to return default.htm (IIS’s favorite), index.php, or any other file.

For example, the URL http://www.cs.odu.edu/~mln/phd/ is actually returning index.cgi. The URL http://www.cs.odu.edu/~mln/phd/index.html returns a 404 since there is no index.html file in the phd directory.

Google and Yahoo both say this URL is indexed when queried with

info:http://www.cs.odu.edu/~mln/phd/
http://search.yahoo.com/search?p=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2F&ei=UTF-8&fr=FP-tab-web-t&fl=0&x=wrt/

and

info:http://www.cs.odu.edu/~mln/phd/index.html
http://search.yahoo.com/search?p=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2Findex.html&ei=UTF-8&fr=FP-tab-web-t&fl=0&x=wrt

There is no URL pointing to index.html on the website, so it seems that both search engines are treating these URLs as one in the same by default.

MSN on the other hand thinks these 2 URLs are referring to separate resources. MSN indicates that the URL is indexed when queried with

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2F&FORM=QBRE/

but not with

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2Findex.html&FORM=QBRE

In this case, MSN is technically correct not to report the index.html URL as being indexed. The problem with MSN’s approach is that by treating the ‘/’ and index.html URLs separately, they are doubling the amount of storage they must perform. The following queries actually return 2 different results:

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Efmccown%2F&FORM=QBNO/
http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Efmccown%2Findex.html&FORM=QBRE

The only way to really tell is by looking at the cached URL returned by MSN. You’ll see the cached URLs are referring to 2 different cached resources. Google and Yahoo return the same cached page regardless of which URL is accessed.

Another problem with MSN's indexing strategy is that if the index.html is left off of a URL that is being queried, MSN may not think it is found. For example, this query results in a found URL:

url:http://www.cs.odu.edu/~cs772/index.html

but this one does not:

url:http://www.cs.odu.edu/~cs772/