Monday, March 27, 2006

Warrick is gaining traction

On Friday I submitted a paper to HYPERTEXT ’06 about our web-repository crawler Warrick. The paper focuses on the difficulties of crawling web repositories like the Internet Archive and Google, and it presents an evaluation of three crawling policies. I really like the conference’s flexible 8-12 page ACM format. A few weeks ago when I was submitting a paper to SIGIR ’06, it was a real pain to get everything to fit on only 8 pages.

On that note, I have updated Warrick with the most recent changes:
  • Warrick now uses the Google API for accessing cached pages.
  • Warrick issues lister queries (queries using “site:” param) to Google using page scraping.
  • Yahoo API libraries were updated due to a March 2006 change.
  • Several minor bugs were corrected.
The biggest reason for integrating the Google API was because Warrick kept getting black-listed by Google after 150 or so queries. Michael suggested I write up my experiences in a technical report. It certainly is something that is going to influence researchers from now on who want to test Google for just about anything.

I also received several emails from the Internet Archive last week about Warrick. Apparently the guys that do backups for people with missing websites are excited about the tool, and IA will start informing users to use it:
If you are tech-savvy and know how to use command-line utilities, you can also refer to the Warrick tool here: http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html and be sure to email the makers as they track who is using the tool. For this tool, a third party has put it together and we cannot guarantee the results. If you have questions about this tool, please refer your questions to the makers themselves.
One of the IA employees told me she has performed at least 200 recoveries for individuals in the past year. That’s a lot of people using “lazy preservation” and sure does support the need for research in this area.

Monday, March 13, 2006

Accoona.com generates buzz

I read an article about Accoona.com this morning which suggested this up-and-coming search engine may give Google a run for their money. Accoona (named after the Swahili phrase “accoona matata” which means “no worries”) has been around for a few years. They threw a huge party in Dec 2004 with guest speaker Bill Clinton in order to generate some buzz, and since then they’ve worked on improving the artificial intelligence which runs their engine.

I did some playing around to see if Accoona could be used as a web repository for Warrick. I did a search for “frank mccown” to see what would appear:


When you place your cursor on top of each search result, it shows the domain name at the bottom of the browser (indicated by the arrows), but not the full path to the URL. Searching through the HTML, I found this very unsightly URL:

<a href="http://www10.overture.com/d/sr/?xargs=15KPjg1mdS55Xyl%5FruNL
bXU6TFhUBf14Prpo5wWsU8AJsIrSE%2DDqchceOZxIEnDvI%5Fq
lOM24HMv6oXLvD%5FkfKJFgeNRkXRFJWL2I%5FHzo4%2DQNmhWK
A2grY6n%2DfpzrZ%2DEH4Gaxi0eaC97Oe6Noz0N2gQ848VkRCY%
5FfluyNnryL5AS%5F6HhQtupVGCY8patrBxvPzydMgfFb9Uf8eRjS%2
DIcZlF3Zx4wZ2EDhVNdnrNoSxj4gfcS3Mq8OyeJJlItLPn2ZZ0Wd
S6ws8XZkSFpfliyS6m&yargs=www.cs.odu.edu" etc...>


The actual URL for the search result can’t be found in the HTML! Apparently they use Overture, a marketing company now owned by Yahoo, to redirect you to the proper URL. They are probably using this strategy for at least two reasons: 1) scrapping their pages for search results won’t get you anything, and 2) they make money giving Overture information about what you click on when performing certain searches.

Unfortunately, it doesn’t look like Accoona is allowing access to cached versions of pages. Maybe they want to save bandwidth, maybe they don’t want to deal with possible lawsuits that caching might entangle them in.

There doesn’t appear to be any advanced search, and there isn’t any help pages describing how to use parameters in searches. I was able to make queries using “site:” and “url:” parameters which are popular on other search engines.

Bottom line: not useful for Warrick just yet.

Thursday, March 09, 2006

Yahoo API has gone 404!

Just today I noticed that my Perl scripts (including Warrick) running the Yahoo Web Search API were consistently returning zero results. I tried running the scripts from different networks, but had the same problem. I scanned the Yahoo API web pages to see if they were alerting us to any problems and nothing.

So I though maybe I should download the most recent version of the API just in case something has changed. I tried downloading yws-1.2.tgz and yws-1.2.zip but got the same error page (right).

I also tried sending an email to the Yahoo list (yws-search-web@yahoogroups.com) but never received the email as a member of the group. Something is really fishy…

Update on 3/13/06:

Yahoo finally admitted the error today. Acording to Toby Elliott, the Yahoo API guy, the error is due to a "default variable in the perl library" and a fix should be out soon. Several other developers complained to the Yahoo list serve, but all the emails were held until today. One noted that the error first appeared on Mar 3. That means it took 10 days for the API error to be acknowledged. Ouch! Was it because Toby was on vacation? ;)

Something else strange: a few days ago I did a Google search for "yahoo api" which resulted in two hits: one pointing to developer.yahoo.net and one to developer.yahoo.com. The sites appeared to be mirrors of each other. One big difference though, the download page for the .com resulted in 404s (see above) but the .net site didn't. A googling of "yahoo api" today just shows only the .net listing. I guess I caught them in the middle of their migration from one TLD to another.

Update on 3/27/06:

I was able to find a working Perl library for Yahoo's Web Search API at CPAN.

Monday, March 06, 2006

Google Video

I did a lot of playing with Google Video yesterday. It’s a service that came online in 2005 and allows users to upload and view free video as well as commercial video. Their price point of $1.99 per video matches iTunes. Neither service offers “bundling” (yet) which would make it cheaper for users to purchase large collections of video (you have to pay $1.99 for each video in The Office Season 1 instead of, say, $9.99 for all six episodes.)

In Feb 2006, it was announced that the National Archives was teaming up with Google Video to allow “researchers and the general public to access a diverse collection of historic movies, documentaries and other films from the National Archives via Google Video as well as the National Archives website.” This is sure to give Google Video even more momentum and legitimacy.

My favorite thing about Google Video is that they tend to have many commercials available. I was able to find Apple’s classic 1984 Super Bowl commercial, Terry Tate Office Linebacker, and the Outpost.com commercial where they tattooed their insignia on toddlers (no actual toddlers were tattooed in the making of the video ;) ). They also have a cool feature which allows you to embed a Google Video directly into your web page and stream the video from Google.

One feature that is noticeably absent is a filtering mechanism. Although Google’s policy is to reject video that is pornographic or obscene, there are quite a few videos that are certainly as close to that line as possible. I would like to be able to filter out the trash although I realize that would create a lot more work for Google.

Wednesday, February 22, 2006

Google in Court about Thumbnail Images

Google has just lost a preliminary injunction against Perfect 10, a website that shows nude photos for a monthly fee. Apparently Perfect 10 doesn’t like it that their images appear in Google Images because the thumbnail images are just the right size for handheld devices. The logic is that mobile users won’t pay for a subscription to Perfect 10 when they can get the same images for free in Google Images. The reason the images appear in Google Images is because they are being harvested from websites of copyright pirates that make illegal copies of the Perfect 10 images.

In my opinion, Google is not really to blame for the copyright infringers. Perfect 10 should be dealing with those people directly and leave Google alone. Although Google states that they will likely just need to remove Perfect 10 photos from Google Images, this could open the door for a lot more litigation. What about all the nude images from hundreds of other websites that are being pirated and placed on indexable websites? It would be wise for Google to have a general policy of keeping their index free of nude photos. In fact, Google would do well to completely separate itself from the pornography industry, an industry that profits on humanity’s baser instincts. (Yes, I do feel strongly on this issue.)

Tuesday, February 21, 2006

Ghostsites

Steve Baldwin has a really nice blog called Ghostsites which is dedicated to really old web pages and sites. A few entries that caught my eye:Baldwin also pointed out John Maeda’s (MIT) side-by-side comparison of Yahoo and Google’s root pages from 1996-2005 as obtained from the Internet Archive. The comparison points out the simplicity of the Google page when compared to the complex Yahoo page.

hanzo:web = Internet Archive + Furl

I just discovered a new web archiving service called hanzo:web. It is similar to Furl except that they allow you to archive an entire website, not just a single web page. From the site:
We have observed important and beautiful websites emerge and disappear from the web everyday. We believe that archiving the content of all sites is a social necessity and needs to take place now! To this effect we intend to archive all sites, pages and links that come through Hanzo and allow free access to this collection forever.

This sounds a lot like the Internet Archive’s mission:
The Internet Archive is working to prevent the Internet — a new medium with major historical significance — and other "born-digital" materials from disappearing into the past. Collaborating with institutions including the Library of Congress and the Smithsonian, we are working to preserve a record for generations to come.

It's difficult to say when the Hanzo service came online. I found some archived content going back to Nov 2005. They’ll be talking about it at the O'Reilly’s Emerging Technology conference in March and will be launching a public beta of their API.

The contents of the hanzo:web archive are apparently accessible to web crawlers. I couldn’t find a robotos.txt file, and if you do a Google search for "warrick reconstruct" you'll get my Warrick page first and the archived version from hanzo:web second! (I don’t know who archived my page, but thanks!)

Notice the Internet Archive-looking URL:

http://hanzoweb.com/archive/20060104132758/http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html

At the bottom of the page they insert some JavaScript to redirect links back to Hanzo:

<SCRIPT language="Javascript">
<!--

// FILE ARCHIVED ON 20060104132758 AND RETRIEVED FROM
// HANZO:WEB ON 2006-02-21 18:47:35.004450.
// JAVASCRIPT APPENDED BY HANZO:WEB.
// ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT
var archiveurl = "http://www.hanzoweb.com/archive/20060104132758/";

function rewriteURL(aCollection, sProp) {
var i = 0;
for(i = 0; i < aCollection.length; i++)
if (aCollection[i][sProp].indexOf("mailto:") == -1 &&
aCollection[i][sProp].indexOf("javascript:") == -1)
aCollection[i][sProp] = archiveurl + aCollection[i][sProp];
}

if (document.links) rewriteURL(document.links, "href");
if (document.images) rewriteURL(document.images, "src");
if (document.embeds) rewriteURL(document.embeds, "src");

if (document.body && document.body.background)
document.body.background = archiveurl + document.body.background;

//-->
</SCRIPT>

This is similar to what Internet Archive does with archived pages.

Hanzo allows pages to be tagged. My Warrick page was tagged with “webarchiving”. Below is a screen shot when accessing my Warrick page from the search interface. This is using frames, so the metadata is shown in the upper frame and the page on the bottom.



Not only did they have this page already archived, they also had archived several other pages from my website. I can’t tell if there is a way to list all pages they have archived I can search for “fmccown” or “www.cs.odu.edu/~fmccown” using their search interface, and all that shows up is my Warrick page. I assume in the next few months they’ll be adding more info about how to find archived pages.

Friday, February 10, 2006

Some thoughts on robots.txt

The Robots Exclusion Protocol has been around since June 1994, but there is no official standards body or RFC for the protocol. That leaves others free to tinker with it and add their own bells and whistles.

There are numerous limitations to robots.txt that have been noted (see Martijn Koster’s article). A few things that are lacking: ability to specify how frequently server requests should be made, the ideal times to make automated requests, permissions to visit vs. index vs. cache (make available in search engine caches).

According to Matt Cutts, Google supports the “Allow:” directive and wildcards (*) which are not part of the standard. The Google Sitemap team even developed a tool that can be used to ensure compliance with their robots.txt non-standard standard. Matt went on to comment that Google does not support a time delay between requests because some webmasters use values that would only allow Google to crawl 15-20 URLs in a day. Yahoo and MSN support this feature using a “Crawl-delay: XX” directive.

Well, I'm out of thoughts. :) Stay tuned…

Wednesday, February 08, 2006

Rank of Graduate Programs in CS

I just found this really cool resource at PhDs.org that helps you rank CS graduate programs:

http://www.phds.org/rankings/computer-science/get_weights

When I set “Program quality has improved recently” to HIGH, ODU appears number 25.

If I set “A SMALL percentage of recent Ph.D.s were granted to U.S. citizens and permanent residents” to HIGH, ODU shoots up to number 6! We certainly are an international school. :)

Thursday, February 02, 2006

Updated C#, VB.NET, Java Comparisons

Today I updated my C# comparison pages for Java and VB.NET:

Maintaining these pages is really time-consuming, but I get so much positive feedback that it makes it worth it. A look the server logs shows that the C# vs. VB.NET page is especially popular. In any given month it’s typically the 4th requested URL from the harding.edu website.



When filtering for just pages produced by faculty members, the C# vs. VB.NET page is first by a factor of 4, and the Java 1.5 vs. C# page is around 4th place.



A shocker is my JavaScript vs. VBScript comparison page appearing 7th. I haven’t updated that thing in years, and who is using VBScript anyway? I guess its still getting attention from those die-hard ASP programmers. ;)

Monday, January 30, 2006

Google Changes the Cached Header Format

It appears that sometime in Jan that Google decided to change the format of the pages cached in their system depending on how the cached page was retrieved. For example, consider the page

http://englewood.k12.co.us/schools/clayton/music.html

If you search for the page in Google's index via

info:http://englewood.k12.co.us/schools/clayton/music.html

and then click on the "Google's cache" link, you will see this page:



But if you try to access the cached version directly via the following URL:

http://search.google.com/search?q=cache:http%3A%2F%2Fenglewood.k12.co.us%2Fschools%2Fclayton%2Fmusic.html

you will see this page:



Notice the font change and the difference in date format ('Jan 21, 2006' --> '21 Jan 2006').

I first noticed the change a few weeks ago. I've also noticed that Google is not always consistent with the heading change. It's possible that the format change is due to changes in different data centers.

Friday, January 27, 2006

Yahoo Reports URLs with No Slash

Yahoo does not properly report URLs that end in a directory with a slash at the end. For example, the query for "site:privacy.getnetwise.org" will yield the following URLs:

1) http://privacy.getnetwise.org/sharing/tools/ns6
2) http://privacy.getnetwise.org/browsing/tools/profiling

among others. Are ns6 and profiling directories or dynamic pages? You can’t tell by just looking at the URLs… Yahoo strips off the slash (`/`) from the end of URLs that are directories. The only way to tell is to actually visit the URL. URL 1 will return a 301 code (moved permanently) along with the correct URL:

http://privacy.getnetwise.org/sharing/tools/ns6/

URL 2 will respond with a 200 code because it is a dynamic page. This is no big deal for the user looking for search results, but it is a big deal for an application like Warrick which needs to know if a URL is pointing to a directory or not without actually visiting the URL.

I’ve contacted Yahoo about the "problem" but did not receive a response:
http://finance.groups.yahoo.com/group/yws-search-web/message/309

Google and MSN don’t have this problem.

Wednesday, January 25, 2006

40 Days of Yahoo Queries

After using the Yahoo API in my Warrick application, I began to wonder if it served different results than the public search interface at http://www.yahoo.com/. Warrick uses the API to discover if Yahoo has a URL that is cached or not.

From an earlier experiment, a colleague of mine had created over 100 PDFs that contained random English words and 3 images:



The PDF documents were placed on my website in a directory in May 2005, and links were created that pointed to the PDFs so they could be crawled by any search engine.

In Dec 2005 I used the Yahoo API to discover which of these PDFs had been indexed. I chose the first 100 URLs that were returned and then created a cron job to query the API and public search interface every morning at 3 a.m. Eastern Time. The queries used the "url:" parameter to determine if each URL was indexed or not.

For example, in order to determine if the URL

http://www.cs.odu.edu/~fmccown/lazyp/dayGroup4/page4-39.pdf

is indexed, Yahoo can be queried with

url:http://www.cs.odu.edu/~fmccown/lazyp/dayGroup4/page4-39.pdf

The public search interface will return a web page with a link to the cached version through the "View as HTML" link:



The Yahoo API will also return the cached URL (CacheUrl) for the same query.

Below are the results from my 40 days of querying. The green dots indicate that the URL is indexed but not cached. The blue dots indicate that the URL is cached. White dots indicate the URL is not indexed at all.



Notice that the public search interface and the API show 2 very different results. The red dots in the graph on the right shows where the 2 responses did not agree with each other.

This table reports the percentage of URLs that were classified as either indexed (but not cached), cached, or not indexed:

Yahoo APIYahoo Public Search Interface
Indexed only3.7%2.1%
Cached89.2%91.3%
Not indexed7.1%6.6%

11% of the time there was a disagreement. For 5 of the URLs the API and public interface disagreed at least 90% of the time!

The inconsistencies discovered between the returned results from the API and public interface suggest that we might get slightly better results using the public interface since it reports 2% more cached URLs. The downside is that any changes made in the results pages may cause our page scrapping code to break.

A further study using different file types (HTML, Word documents, PowerPoint docs, etc.) is needed to be more conclusive. Also it might be useful to use URLs from a variety of websites, not just from one since Yahoo could treat URLs from other sites differently.

Monday, January 23, 2006

Paper Rejection

I recently received a “sorry but your paper was not accepted” notice for one of my papers. This paper was probably the best one I’ve written to date. The conference that rejected my paper is a top-notch, international conference that is really competitive.

According to the rejection letter, they only accepted 11% of the papers (about 80) and therefore rejected around 700 papers. If each paper had on average just 2 authors (that’s probably a little low) then around 1400 people received the same rejection notice. If each paper took on average 100 hours to write (collecting data, preparing, writing, etc., and again that’s got to be too low of an estimate) then 70,000 hours have completely been wasted, not to mention the time required by the reviewers to read all these rejected papers.

Now these rejected individuals (most with PhDs) get to re-craft and re-package their same results for a new conference which has different requirements (less pages, new format, etc.), adding another 5 hours minimum per paper, which results in another 3500 hours spent on the same set of results. Meanwhile these re-formulated papers will compete with a new batch of papers that have been prepared by others. Also the results are getting stale. Unless the new paper gets accepted at the next conference, the cycle will continue.

This seems like a formula guaranteed to produce madness.

Wednesday, January 18, 2006

arcget is a little too late

Gordon Mohr from the Internet Archive told me about a program called arcget that essentially does the same thing as Warrick but only works with the Internet Archive. Aaron Swartz apparently wrote it during his Christmas break last Dec. It’s too bad he didn’t know about Warrick. That seems to be the problem in general with creating a new piece of software. How do you know if it already exists so you don't waste your time duplicating someone else's efforts? All you can do is search the Web with some carefully chosen words and see what pops up.

Wednesday, January 11, 2006

Search Engine Partnership Chart

I really like this animated chart showing how search engines feed others results:

http://www.ihelpyou.com/search-engine-chart.html

Hopefully it is kept up to date.

Tuesday, January 10, 2006

Case Insensitive Crawling

What should a Web crawler do when it is crawling a website that is housed on a Windows web server and it comes across the following URLs:

http://foo.org/bar.html
http://foo.org/BAR.html

?

A very smart crawler would recognize that these URLs point to the same resource by comparing the ETags or even better, by examining the “Server:” HTTP header in the response. If the “Server:” contains

Server: Microsoft-IIS/5.1

or

Server: Apache/2.0.45 (Win32)

then it would know that the URLs refer to the same resource since Windows uses a case insensitive filesystem. If *nix is the web server’s OS then it can be safely assumed that the URLs are case sensitive. Other operating systems like Darwin (the open source UNIX-based foundation of Mac OS X) may use HFS+ which is a case insensitive filesystem. A danger of using such a filesystem with Apache’s directory protection is discussed here. Mac OS X could also use UFS which is case sensitive, and therefore a crawler cannot make a general rule for this OS to ignore case sensitive URLs.

Finding URLs in Google, MSN, and Yahoo from case insensitive websites is problematic. Consider the following URL:

http://www.harding.edu/user/fmccown/www/comp250/syllabus250.html

Searching for this URL in Google (using the “info:” parameter) will return back a “not found” page. But if the URL

http://www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html

is searched for, it will be found. Yahoo performs similarly. It will find the all-lowercase version of the URL but not the mixed-case version.

MSN takes the most flexible approach. All “url:” queries are performed in a case insensitive manner. They appear to take this strategy regardless of the website’s OS since a search for

url:http://www.cs.odu.edu/~FMCCOwn/

and

url:http://www.cs.odu.edu/~fmccown/

are both found by MSN even though the first URL is not valid (Unix filesystem). The disadvantage of this approach is what happens when bar.html and BAR.html are 2 different files on the web server. Would MSN only index one of the files?

The Internet Archive, like Google and Yahoo, is pinicky about case. The following URL is found:

http://web.archive.org/web/*/www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html


but this is not:

http://web.archive.org/web/*/www.harding.edu/user/fmccown/www/comp250/syllabus250.html


Update 3/30/06:

If you found this information interesting, you might want to check out my paper Evaluation of Crawling Policies for a Web-Repository Crawler which discusses these issues.

Google Is Sorry

Google has been really confusing some of its users recently with their “Google is sorry” web page. The page reads like this:

We're sorry... but we can't process your request right now. A computer virus or spyware application is sending us automated requests, and it appears that your computer or network has been infected. We'll restore your access as quickly as possible, so try again soon. In the meantime, you might want to run a virus checker or spyware remover to make sure that your computer is free of viruses and other spurious software. We apologize for the inconvenience, and hope we'll see you again on Google.

It appears this page started appearing in mass around Nov-Dec of 2005. There are many discussions about it in on-line forums. Here are 2 of them that garnered a lot of attention:

  1. Webmasterworld.com
  2. Google groups

I ran into the error when modifying Warrick to use the “site:” parameter in order to better reconstruct a website. Unfortunately I had to drop the feature, and although I’m still making automated queries, I’ve yet to see the page again.

Google appears to be mum about the whole thing. The most credible explanation I found was here:

http://www.emailbattles.com/archive/battles/virus_aacdehdcic_ei/

Apparently it is a new "feature" of Google that is getting back at bandwidth-hogging SEOs that use automated queries with "site:" or "allinurl:" in them. Their IA is a little over-zealous and is hurting the regular human user and the user like me who is performing very limited daily queries for no financial gain.

Update on 3/8/2006:

Google has caught me again! Although my scripts ran for a while without seeing the sorry page, they started getting caught again in early Feb. I conversed with someone at Google about it who basically said sorry but there is nothing they can do and that I should use their API.

The Google API is rather constrained for my purposes. I've noticed many API users venting their frustrations at the inconsistent results returned by the API when compared to the public search interface.

I finally decided to use a hybrid approach: page scraping when performing "site:" queries and the API to access cached pages. I haven't had any trouble from Google since.

Monday, January 09, 2006

MSN the first to index my blog

Congratulations to MSN for being the first of the big 4 search engines (Google, MSN, Yahoo, Ask Jeeves) to have indexed my blog! By examining the root level cached page, it looks like they crawled it around Jan 1-4. The only way any search engine can find the blog is to crawl my ODU website or by crawling any links that may exist to it from http://www.blogger.com/.

Missing index.html

When Google, MSN, and Yahoo are indexing your website, the question arises, are these 2 URLs equivalent:

http://foo.org/dir/

and

http://foo.org/dir/index.html

?

For example, consider the URL http://www.cs.odu.edu/~fmccown/. The web server is configured to return the index.html file when accessed. The following URL will access the same resource: http://www.cs.odu.edu/~fmccown/index.html

A crawler could discover that these resources are the same by doing a comparison of the content or simply comparing the ETag (if present).

The problem is that not all requests ending in ‘/’ are actually requesting the index.html file. The web server could be configured to return default.htm (IIS’s favorite), index.php, or any other file.

For example, the URL http://www.cs.odu.edu/~mln/phd/ is actually returning index.cgi. The URL http://www.cs.odu.edu/~mln/phd/index.html returns a 404 since there is no index.html file in the phd directory.

Google and Yahoo both say this URL is indexed when queried with

info:http://www.cs.odu.edu/~mln/phd/
http://search.yahoo.com/search?p=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2F&amp;ei=UTF-8&fr=FP-tab-web-t&fl=0&x=wrt/

and

info:http://www.cs.odu.edu/~mln/phd/index.html
http://search.yahoo.com/search?p=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2Findex.html&amp;ei=UTF-8&fr=FP-tab-web-t&fl=0&x=wrt

There is no URL pointing to index.html on the website, so it seems that both search engines are treating these URLs as one in the same by default.

MSN on the other hand thinks these 2 URLs are referring to separate resources. MSN indicates that the URL is indexed when queried with

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2F&FORM=QBRE/

but not with

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2Findex.html&FORM=QBRE

In this case, MSN is technically correct not to report the index.html URL as being indexed. The problem with MSN’s approach is that by treating the ‘/’ and index.html URLs separately, they are doubling the amount of storage they must perform. The following queries actually return 2 different results:

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Efmccown%2F&FORM=QBNO/
http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Efmccown%2Findex.html&FORM=QBRE

The only way to really tell is by looking at the cached URL returned by MSN. You’ll see the cached URLs are referring to 2 different cached resources. Google and Yahoo return the same cached page regardless of which URL is accessed.

Another problem with MSN's indexing strategy is that if the index.html is left off of a URL that is being queried, MSN may not think it is found. For example, this query results in a found URL:

url:http://www.cs.odu.edu/~cs772/index.html

but this one does not:

url:http://www.cs.odu.edu/~cs772/