Thursday, January 05, 2006

The Redundant 'www' URL Prefix

Many websites allow the user to access their site using "www.thesite.com" or just "thesite.com". For example, you can access Search Engine Watch via http://www.searchenginewatch.com/ or http://searchenginewatch.com/. When using the “www” URL, the Search Engine Watch web server replies with an http 301 (Moved Permanently) status code and the new URL http://searchenginewatch.com/ which should be used instead. When search engines crawl the site using the “www” URL, the 301 status code tells them they should index the non-www URL instead.

Unfortunately some websites that offer the two URLs for accessing their site do not redirect one of the URLs, so search engine crawlers may in fact index both types of URLs. For example, Otego Settlers Museum allows access via http://otago.settlers.museum/ and http://www.otago.settlers.museum/. To see a listing of all the URLs that point to this site, you can use

site:otago.settlers.museum

to query Google, MSN, and Yahoo and be shown URLs with and without the “www” prefix. It looks like the search engines are smart enough not to index the same resource pointed to by both URLs. In fact, you can query Google with

info:http://otago.settlers.museum/shippinglists.asp

and the URL www.otago.settlers.museum/shippinglists.asp is returned. MSN performs the same way, but Yahoo gets confused by the query and says the URL is not indexed. By the way, I did alert Yahoo to the problem, and I’ll amend this entry if it is fixed.

As a general rule, websites should be setup to use a 301 redirect to the non-www version of the URL. This would make things a lot simpler for crawlers that have difficulty knowing that 2 different URLs actually refer to the same resource. Matt Cutts noted recently in a blog entry about URL canonicalization that Google prefers this setup.

The confusion that this problem causes for humans has been addressed before: