Monday, January 30, 2006

Google Changes the Cached Header Format

It appears that sometime in Jan that Google decided to change the format of the pages cached in their system depending on how the cached page was retrieved. For example, consider the page

http://englewood.k12.co.us/schools/clayton/music.html

If you search for the page in Google's index via

info:http://englewood.k12.co.us/schools/clayton/music.html

and then click on the "Google's cache" link, you will see this page:



But if you try to access the cached version directly via the following URL:

http://search.google.com/search?q=cache:http%3A%2F%2Fenglewood.k12.co.us%2Fschools%2Fclayton%2Fmusic.html

you will see this page:



Notice the font change and the difference in date format ('Jan 21, 2006' --> '21 Jan 2006').

I first noticed the change a few weeks ago. I've also noticed that Google is not always consistent with the heading change. It's possible that the format change is due to changes in different data centers.

Friday, January 27, 2006

Yahoo Reports URLs with No Slash

Yahoo does not properly report URLs that end in a directory with a slash at the end. For example, the query for "site:privacy.getnetwise.org" will yield the following URLs:

1) http://privacy.getnetwise.org/sharing/tools/ns6
2) http://privacy.getnetwise.org/browsing/tools/profiling

among others. Are ns6 and profiling directories or dynamic pages? You can’t tell by just looking at the URLs… Yahoo strips off the slash (`/`) from the end of URLs that are directories. The only way to tell is to actually visit the URL. URL 1 will return a 301 code (moved permanently) along with the correct URL:

http://privacy.getnetwise.org/sharing/tools/ns6/

URL 2 will respond with a 200 code because it is a dynamic page. This is no big deal for the user looking for search results, but it is a big deal for an application like Warrick which needs to know if a URL is pointing to a directory or not without actually visiting the URL.

I’ve contacted Yahoo about the "problem" but did not receive a response:
http://finance.groups.yahoo.com/group/yws-search-web/message/309

Google and MSN don’t have this problem.

Wednesday, January 25, 2006

40 Days of Yahoo Queries

After using the Yahoo API in my Warrick application, I began to wonder if it served different results than the public search interface at http://www.yahoo.com/. Warrick uses the API to discover if Yahoo has a URL that is cached or not.

From an earlier experiment, a colleague of mine had created over 100 PDFs that contained random English words and 3 images:



The PDF documents were placed on my website in a directory in May 2005, and links were created that pointed to the PDFs so they could be crawled by any search engine.

In Dec 2005 I used the Yahoo API to discover which of these PDFs had been indexed. I chose the first 100 URLs that were returned and then created a cron job to query the API and public search interface every morning at 3 a.m. Eastern Time. The queries used the "url:" parameter to determine if each URL was indexed or not.

For example, in order to determine if the URL

http://www.cs.odu.edu/~fmccown/lazyp/dayGroup4/page4-39.pdf

is indexed, Yahoo can be queried with

url:http://www.cs.odu.edu/~fmccown/lazyp/dayGroup4/page4-39.pdf

The public search interface will return a web page with a link to the cached version through the "View as HTML" link:



The Yahoo API will also return the cached URL (CacheUrl) for the same query.

Below are the results from my 40 days of querying. The green dots indicate that the URL is indexed but not cached. The blue dots indicate that the URL is cached. White dots indicate the URL is not indexed at all.



Notice that the public search interface and the API show 2 very different results. The red dots in the graph on the right shows where the 2 responses did not agree with each other.

This table reports the percentage of URLs that were classified as either indexed (but not cached), cached, or not indexed:

Yahoo APIYahoo Public Search Interface
Indexed only3.7%2.1%
Cached89.2%91.3%
Not indexed7.1%6.6%

11% of the time there was a disagreement. For 5 of the URLs the API and public interface disagreed at least 90% of the time!

The inconsistencies discovered between the returned results from the API and public interface suggest that we might get slightly better results using the public interface since it reports 2% more cached URLs. The downside is that any changes made in the results pages may cause our page scrapping code to break.

A further study using different file types (HTML, Word documents, PowerPoint docs, etc.) is needed to be more conclusive. Also it might be useful to use URLs from a variety of websites, not just from one since Yahoo could treat URLs from other sites differently.

Monday, January 23, 2006

Paper Rejection

I recently received a “sorry but your paper was not accepted” notice for one of my papers. This paper was probably the best one I’ve written to date. The conference that rejected my paper is a top-notch, international conference that is really competitive.

According to the rejection letter, they only accepted 11% of the papers (about 80) and therefore rejected around 700 papers. If each paper had on average just 2 authors (that’s probably a little low) then around 1400 people received the same rejection notice. If each paper took on average 100 hours to write (collecting data, preparing, writing, etc., and again that’s got to be too low of an estimate) then 70,000 hours have completely been wasted, not to mention the time required by the reviewers to read all these rejected papers.

Now these rejected individuals (most with PhDs) get to re-craft and re-package their same results for a new conference which has different requirements (less pages, new format, etc.), adding another 5 hours minimum per paper, which results in another 3500 hours spent on the same set of results. Meanwhile these re-formulated papers will compete with a new batch of papers that have been prepared by others. Also the results are getting stale. Unless the new paper gets accepted at the next conference, the cycle will continue.

This seems like a formula guaranteed to produce madness.

Wednesday, January 18, 2006

arcget is a little too late

Gordon Mohr from the Internet Archive told me about a program called arcget that essentially does the same thing as Warrick but only works with the Internet Archive. Aaron Swartz apparently wrote it during his Christmas break last Dec. It’s too bad he didn’t know about Warrick. That seems to be the problem in general with creating a new piece of software. How do you know if it already exists so you don't waste your time duplicating someone else's efforts? All you can do is search the Web with some carefully chosen words and see what pops up.

Wednesday, January 11, 2006

Search Engine Partnership Chart

I really like this animated chart showing how search engines feed others results:

http://www.ihelpyou.com/search-engine-chart.html

Hopefully it is kept up to date.

Tuesday, January 10, 2006

Case Insensitive Crawling

What should a Web crawler do when it is crawling a website that is housed on a Windows web server and it comes across the following URLs:

http://foo.org/bar.html
http://foo.org/BAR.html

?

A very smart crawler would recognize that these URLs point to the same resource by comparing the ETags or even better, by examining the “Server:” HTTP header in the response. If the “Server:” contains

Server: Microsoft-IIS/5.1

or

Server: Apache/2.0.45 (Win32)

then it would know that the URLs refer to the same resource since Windows uses a case insensitive filesystem. If *nix is the web server’s OS then it can be safely assumed that the URLs are case sensitive. Other operating systems like Darwin (the open source UNIX-based foundation of Mac OS X) may use HFS+ which is a case insensitive filesystem. A danger of using such a filesystem with Apache’s directory protection is discussed here. Mac OS X could also use UFS which is case sensitive, and therefore a crawler cannot make a general rule for this OS to ignore case sensitive URLs.

Finding URLs in Google, MSN, and Yahoo from case insensitive websites is problematic. Consider the following URL:

http://www.harding.edu/user/fmccown/www/comp250/syllabus250.html

Searching for this URL in Google (using the “info:” parameter) will return back a “not found” page. But if the URL

http://www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html

is searched for, it will be found. Yahoo performs similarly. It will find the all-lowercase version of the URL but not the mixed-case version.

MSN takes the most flexible approach. All “url:” queries are performed in a case insensitive manner. They appear to take this strategy regardless of the website’s OS since a search for

url:http://www.cs.odu.edu/~FMCCOwn/

and

url:http://www.cs.odu.edu/~fmccown/

are both found by MSN even though the first URL is not valid (Unix filesystem). The disadvantage of this approach is what happens when bar.html and BAR.html are 2 different files on the web server. Would MSN only index one of the files?

The Internet Archive, like Google and Yahoo, is pinicky about case. The following URL is found:

http://web.archive.org/web/*/www.harding.edu/USER/fmccown/WWW/comp250/syllabus250.html


but this is not:

http://web.archive.org/web/*/www.harding.edu/user/fmccown/www/comp250/syllabus250.html


Update 3/30/06:

If you found this information interesting, you might want to check out my paper Evaluation of Crawling Policies for a Web-Repository Crawler which discusses these issues.

Google Is Sorry

Google has been really confusing some of its users recently with their “Google is sorry” web page. The page reads like this:

We're sorry... but we can't process your request right now. A computer virus or spyware application is sending us automated requests, and it appears that your computer or network has been infected. We'll restore your access as quickly as possible, so try again soon. In the meantime, you might want to run a virus checker or spyware remover to make sure that your computer is free of viruses and other spurious software. We apologize for the inconvenience, and hope we'll see you again on Google.

It appears this page started appearing in mass around Nov-Dec of 2005. There are many discussions about it in on-line forums. Here are 2 of them that garnered a lot of attention:

  1. Webmasterworld.com
  2. Google groups

I ran into the error when modifying Warrick to use the “site:” parameter in order to better reconstruct a website. Unfortunately I had to drop the feature, and although I’m still making automated queries, I’ve yet to see the page again.

Google appears to be mum about the whole thing. The most credible explanation I found was here:

http://www.emailbattles.com/archive/battles/virus_aacdehdcic_ei/

Apparently it is a new "feature" of Google that is getting back at bandwidth-hogging SEOs that use automated queries with "site:" or "allinurl:" in them. Their IA is a little over-zealous and is hurting the regular human user and the user like me who is performing very limited daily queries for no financial gain.

Update on 3/8/2006:

Google has caught me again! Although my scripts ran for a while without seeing the sorry page, they started getting caught again in early Feb. I conversed with someone at Google about it who basically said sorry but there is nothing they can do and that I should use their API.

The Google API is rather constrained for my purposes. I've noticed many API users venting their frustrations at the inconsistent results returned by the API when compared to the public search interface.

I finally decided to use a hybrid approach: page scraping when performing "site:" queries and the API to access cached pages. I haven't had any trouble from Google since.

Monday, January 09, 2006

MSN the first to index my blog

Congratulations to MSN for being the first of the big 4 search engines (Google, MSN, Yahoo, Ask Jeeves) to have indexed my blog! By examining the root level cached page, it looks like they crawled it around Jan 1-4. The only way any search engine can find the blog is to crawl my ODU website or by crawling any links that may exist to it from http://www.blogger.com/.

Missing index.html

When Google, MSN, and Yahoo are indexing your website, the question arises, are these 2 URLs equivalent:

http://foo.org/dir/

and

http://foo.org/dir/index.html

?

For example, consider the URL http://www.cs.odu.edu/~fmccown/. The web server is configured to return the index.html file when accessed. The following URL will access the same resource: http://www.cs.odu.edu/~fmccown/index.html

A crawler could discover that these resources are the same by doing a comparison of the content or simply comparing the ETag (if present).

The problem is that not all requests ending in ‘/’ are actually requesting the index.html file. The web server could be configured to return default.htm (IIS’s favorite), index.php, or any other file.

For example, the URL http://www.cs.odu.edu/~mln/phd/ is actually returning index.cgi. The URL http://www.cs.odu.edu/~mln/phd/index.html returns a 404 since there is no index.html file in the phd directory.

Google and Yahoo both say this URL is indexed when queried with

info:http://www.cs.odu.edu/~mln/phd/
http://search.yahoo.com/search?p=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2F&ei=UTF-8&fr=FP-tab-web-t&fl=0&x=wrt/

and

info:http://www.cs.odu.edu/~mln/phd/index.html
http://search.yahoo.com/search?p=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2Findex.html&ei=UTF-8&fr=FP-tab-web-t&fl=0&x=wrt

There is no URL pointing to index.html on the website, so it seems that both search engines are treating these URLs as one in the same by default.

MSN on the other hand thinks these 2 URLs are referring to separate resources. MSN indicates that the URL is indexed when queried with

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2F&FORM=QBRE/

but not with

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Emln%2Fphd%2Findex.html&FORM=QBRE

In this case, MSN is technically correct not to report the index.html URL as being indexed. The problem with MSN’s approach is that by treating the ‘/’ and index.html URLs separately, they are doubling the amount of storage they must perform. The following queries actually return 2 different results:

http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Efmccown%2F&FORM=QBNO/
http://search.msn.com/results.aspx?q=url%3Ahttp%3A%2F%2Fwww.cs.odu.edu%2F%7Efmccown%2Findex.html&FORM=QBRE

The only way to really tell is by looking at the cached URL returned by MSN. You’ll see the cached URLs are referring to 2 different cached resources. Google and Yahoo return the same cached page regardless of which URL is accessed.

Another problem with MSN's indexing strategy is that if the index.html is left off of a URL that is being queried, MSN may not think it is found. For example, this query results in a found URL:

url:http://www.cs.odu.edu/~cs772/index.html

but this one does not:

url:http://www.cs.odu.edu/~cs772/

Friday, January 06, 2006

Reconstructing Websites with Warrick

What happens when your hard drive crashes, the backups you meant to make are nowhere to be found, and your website has now disappeared from the Web? Or what happens when your web hosting company has a fire, and all their backups of your website go up in flames? When such a calamity occurs, an obvious place to look for a backup of your website is at the Internet Archive. Unfortunately they don’t have the resources to archive every website out there. A not so obvious place to look is in the caches that search engines like Google, MSN, and Yahoo make available.


My research focuses on recovering lost websites, and my research group has recently created a tool called Warrick which can reconstruct a website by pulling missing resources from the Internet Archive, Google, Yahoo, and MSN. We have published some of our results using Warrick in a technical report that you can view at arXiv.org.

Warrick is currently undergoing some modifications as we get ready to perform a new batch of website reconstructions. Hopefully I’ll have a stable version of Warrick available for download soon.

Update on 3/20/07:

Warrick has been made available (for quite some time) here and our initial experiments were formally published in Lazy Preservation: Reconstructing Websites by Crawling the Crawlers (WIDM 2006).

Java Only

Joel has posted a rant on schools that teach Java only and neglect C/C++.  I’m proud to say my alma mater is still teaching C++ and pointers in its intro classes.

Thursday, January 05, 2006

The Redundant 'www' URL Prefix

Many websites allow the user to access their site using "www.thesite.com" or just "thesite.com". For example, you can access Search Engine Watch via http://www.searchenginewatch.com/ or http://searchenginewatch.com/. When using the “www” URL, the Search Engine Watch web server replies with an http 301 (Moved Permanently) status code and the new URL http://searchenginewatch.com/ which should be used instead. When search engines crawl the site using the “www” URL, the 301 status code tells them they should index the non-www URL instead.

Unfortunately some websites that offer the two URLs for accessing their site do not redirect one of the URLs, so search engine crawlers may in fact index both types of URLs. For example, Otego Settlers Museum allows access via http://otago.settlers.museum/ and http://www.otago.settlers.museum/. To see a listing of all the URLs that point to this site, you can use

site:otago.settlers.museum

to query Google, MSN, and Yahoo and be shown URLs with and without the “www” prefix. It looks like the search engines are smart enough not to index the same resource pointed to by both URLs. In fact, you can query Google with

info:http://otago.settlers.museum/shippinglists.asp

and the URL www.otago.settlers.museum/shippinglists.asp is returned. MSN performs the same way, but Yahoo gets confused by the query and says the URL is not indexed. By the way, I did alert Yahoo to the problem, and I’ll amend this entry if it is fixed.

As a general rule, websites should be setup to use a 301 redirect to the non-www version of the URL. This would make things a lot simpler for crawlers that have difficulty knowing that 2 different URLs actually refer to the same resource. Matt Cutts noted recently in a blog entry about URL canonicalization that Google prefers this setup.

The confusion that this problem causes for humans has been addressed before:

Wednesday, January 04, 2006

Pulling out of the search engine index

It looks like Craigslist has decided to block the crawling of its web site.

Webmasterworld.com also banned search engines from crawling their pages back in Nov 2005. It only took 2 days for all the pages to fall out of Google’s and MSN’s indexes. Interesting... not only do they ban all bots, they also use their robots.txt file to post blog entries.

Popular sites like these devote huge amounts of bandwidth to crawler traffic. This is why we are researching more efficient methods for search engines to index content from web servers by using OAI-PMH, a protocol which is very popular in the digital library world. Michael L. Nelson (Old Dominion University) and others are building an Apache module called mod_oai which allows a search engine to ask a web server "what new URLs do you have?" and "what content has changed since yesterday?" Widespread adoption of mod_oai could decrease Web traffic immensely and allow sites with bandwidth limitations to jump back into the search engine game.

Sunday, January 01, 2006

Comparison of the Google, MSN, and Yahoo APIs

All three major search engines have now released public APIs so users can more easily automate Web search queries without the need for page-scraping. Google was the first to release an API in 2002. They required users to register for a license key which allowed them to make 1000 queries per day. Yahoo was next on the scene (early 2005), and in an effort to out-do Google, they upped the number of daily queries to 5000 per IP address, a much more flexible arrangement than the Google limit that disregarded the IP address. MSN finally came on board in late 2005 with a 10,000 daily limit per IP address.

All three services have a message board or forum where users can communicate with each other (and hopefully a representative from the service providers). In my experience, Yahoo does the best job at monitoring the forum (actually an e-mail list) and giving feedback.



I have recently used the Yahoo and MSN APIs to develop Warrick, a tool for recovering websites that disappear due to a catastrophe. I was unable to use the Google API because of its restrictive nature (1000 daily queries). Had I used the API, users would have to sign up for a Google license before running Warrick, and I didn’t want every Warrick user to jump through that hoop. Yahoo and MSN’s more flexible query limits allowed me to use their APIs much more easily. I still limited my daily Google queries to 1000 to be polite and avoid getting my queries ignored.

Below is a comparison of the Google, Yahoo, and MSN Web search APIs that I have compiled. This may be useful for someone who is considering using the APIs.

Underlying technology:
G: SOAP
Y: REST
M: SOAP

Number of queries available per day:
G: 1000 per 24 hours per user ID
Y: 5000 per 24 hours per application ID per IP address
M: 10,000 per 24 hours per application ID per IP address

Getting started (examples that are supplied directly by the search engines):
G: Examples in Java and .NET
Y: Examples in Perl, PHP, Python, Java, JavaScript, Flash
M: Examples in .NET

Access to title, description, and cached URL?
G: Yes
Y: Yes
M: Yes

Access to last updated/crawled date?
G: No
Y: Last-Modified date if present
M: No

Access to images:
G: No
Y: Yes
M: No

Maximum number of results per query:
G: 10 Ouch!
Y: 100
M: 50

Maximum number of results per query that can be obtained by "paging" through the results:
G: 1000
Y: 1000
M: 250

The Google Story

I finished reading The Google Story yesterday and really enjoyed getting a good look at how Google got started. Sergey and Larry (the Google founders) are my age, and it's amazing to see what has become of their research from grad school at Stanford. The authors at times seemed a little too enamored with Google, but in general they provided some great insight. After reading the book, I feel almost stupid for not buying Google stock when it first became available!

I love the 20% rule and the fact that they often support projects that initially have no obvious monetary rewards for the company. There was also an informative chapter about Danny Sullivan and how he started SearchEngineWatch.com.

I was however troubled to discover that Google is making huge amounts of money on selling pornographic ads in the AdSense program. Thankfully they do provide a SafeSearch mechanism that protects users from seeing pornography in their search results (although it is by no means perfect). It seems odd to me that a company that lives by the motto "Do No Evil" would profit from an industry that is so self-destructive to itself and its consumers. Google apparently believes it is morally wrong to advertise guns but not wrong to advertise pornography to minors. I’d really be interested in hearing an explanation from Sergey and Larry about this practice.

I wonder if Stanford will grant Sergey and Larry their PhDs? Obviously their work far surpasses what anyone has ever done to earn a PhD, and from what I understand, they have completed everything but their dissertations.