Monday, January 29, 2007

Defusing the Googlebomb

A few days ago, Google made some changes to their ranking algorithms to reduce the practice of Googlebombing. A Google bomb is basically a prank to manipulate the ranking of pages in Google’s search engine. It involves getting a lot of people to put a link to a particular web page on their site with the anchor text they want associated with the web page.

For example, if I wanted a search for “basketball stud” to show my blog as the first result, I’d get as many people as I could to place a link on their website that looks like this:
<a href="">basketball stud</a>

Then when Google crawls the Web and sees a large number of links that look like this, they would begin to favor this page over the rest when users search for “basketball stud”.

One of the most famous Googlebombs involves a search for miserable failure. While this used to show George Bush’s web page first in the results, it now brings up more relevant results. Danny Sullivan has written a good article about this.

How did Google reduce the affects of Google bombs? They’re not giving particulars, but they have admitted it’s purely automated. My guess is they analyze several factors:
  1. When and where was the link first found? Possibly Google tracks the growth of particular links.
  2. Does the link make sense for the web page or website? A red flag might be raised when a website about hacking points to a government web page when none of the other links do.
  3. Is the target page actually "about" the anchor text? If the words "miserable failure" aren't on the target page, it could be a bomb.

If you’d really like to dig into this subject, here’s a master’s thesis on the topic.

Thursday, January 25, 2007

Wikipedia: nofollow and noMSedit

Some new news from the world of Wikipedia:
  • Minor: All external links from Wikipedia are now using the NOFOLLOW attribute. This attribute tells web crawlers like Google that the link has not been vetted, so it will not be used in their algorithms to artificially bolster the ranking of some pages. Wikipedia’s action will seriously reduce the amount of link spam that currently plagues many entries.

  • Major: Microsoft has attempted to hire Rick Jelliffe, chief technology officer of XML tools company Topologi Pty. Ltd., to “correct” Wikipedia entries on ODF (OpenDocument format) and OOXML (Microsoft Office Open XML). You can see Rick's original post about the offer here. Apparently Wikipedia is keeping Microsoft employees from making the edits themselves, so Microsoft thought a third party could update the entries that apparently shed a negative light on Microsoft’s format. This astroturfing blunder has created quite a few waves.

Tuesday, January 23, 2007

Countdown to baby

I’ve made a new addition to the right side of my blog- a countdown until Ethan's expected delivery date on April 4. Maybe this should be labeled Countdown until life as we know it changes dramatically. I can’t wait until he’s here. Becky can’t wait until she can breathe again and bend down and touch her toes. At the same time, I’m a bit spooked because I have experiments to perform, papers to write, and a dissertation to compose. Although I thought I’d have it all done by the summer, it’s not looking that way anymore. But hey, having my boy here will be a great blessing.

By the way, what an amazing Colts/Pats game on Sunday! If I can’t have my boys in the Super Bowl, at least I can cheer for Manning. That guy's pretty good... if you like a 6-5, 230-pound quarterback with laser-rocket arm.

Friday, January 19, 2007

No more searching for you: Google drops the SOAP

In case you were asleep at the helm like I was, Google has pulled the plug on their SOAP-based web search API. On Dec 5, 2006, Google stopped giving users new API keys. They claim the API service will continue to run, but without a method for obtaining new keys, it essentially becomes worthless (API keys can't be shared since they are tied to a specific individual's Google account, and I can't let you run my application unless you supply it with your own key).

Google has decided their AJAX search API is the wave of the future. But why the "odd move"? I think Jason Lefkowitz summed it up best:
Today, though, Google isn’t about search. It’s about displaying ads. And in that context, an open API makes no sense — the developer can reformat the search results, and even show them (gasp) without ads!

Hence the “AJAX API”, which forces you to take the ads along with the search results. You can’t really do much with it, but it does create a new place for Google to show ads on — your blog/site/Web app.
I don’t have a problem with Google focusing on their AJAX search API... I’m sure it’s very useful in many contexts, but I do have a problem with them abandoning their SOAP search. Not only is Google putting the smack down on the SEO business (one of their intended victims, in my opinion), they are hurting us web researchers who depend on automated methods of querying Google.

I can point to a huge stack of academic papers that, without an effective method of automatically querying Google, are un-reproducible (Google- do you really want everyone to go back to page-scraping?). And it’s really hurting my research: Warrick will not work for new users without API keys. I’ve spent lots of time writing wrappers around the SOAP API code, now I’ll have to redo most of when I find an effective method of accessing Google’s cache. Until then, you can kiss your lost website goodbye if Google is the only one who has cached it.

It sometimes appears that have a love/hate relationship with Google. Yesterday I was singing it's praises, today not so much. In honor of the SOAP API, I’ve put together a brief timeline for us all to reflect upon:
  • Pre 2002 - Page-scraping is the norm, and there is great frustration.
  • 2002 – Google launches the first search engine API, and there is great rejoicing.
  • 2002-2005 – Researchers use the API to for all sorts of interesting experiments, SEOs do their best to reverse engineer PageRank, new services are built, books are written, and, despite many technical difficulties along the way, there is much satisfaction.
  • 2006 – Google tightens the lid on extra queries per key, and there is much displeasure.
  • Late 2006 – Google refuses to give new API keys, and there is much sadness and anger.
  • Late 2007 (My prediction) - Google’s SOAP API breaks, no one fixes it, and there is no surprise. RIP

Update on July 27, 2007:

Google has just released an academic API for researchers: University Research Program for Google Search. Now that's more like it.

Update on Sept 30, 2009:

Google has finally killed its SOAP Search API.

Thursday, January 18, 2007

Store your data in a search engine cache

I have say it… Google is the best thing since indoor pluming. This morning I was wondering if anyone has been writing about my program Warrick, so I did a quick search of “warrick mccown”, just to see what would pop up. On the second page of results, I found a link to a paper that was published in October 2006: Using Free Web Storage for Data Backup.

The paper was written by some researchers from Stony Brook University who have developed two backup systems: CrawlBackup for storing files in a search engine’s cache, and MailBackup for storing files in the mailboxes of Internet email providers. Their work is remarkably similar to ours, and it almost makes me wonder if our place is bugged.

This paper is the first to actually cite Warrick. The paper also cites an interesting blog posting from Dec 2005: How the Google Cache can save Your A$$. OK, not the best title in the world, but it’s the only pseudo-article I've found where someone has documented using the Google cache to recover a lost website. In this case the guy accidentally deleted 30 articles from his website and used Google’s cache to recover them. It was just a few months earlier that I had finished work on Warrick which could have automated the process for him (at least he only had to recover 30 pages!). He also used the Internet Archive to recover a client’s website a few years ago.

So I’m really glad to have found these related resources. What’s unfortunate is that finding related work is often much more difficult than a simple Google search (or even a Google Scholar search). Google may produce a few gold nuggets, but it also produces a lot of false positives: why is the third result a production chart for Shaun Alexander (I really don't need to be reminded of the Cowboys loss in the playoffs)? The word Warrick is only used once, and it’s in a drop-down list box! And lest we forget, Google does not have the entire Web indexed. If I really wanted to be diligent I'd also use MSN, Yahoo, or a metasearcher like Dogpile.

Anyway, I’m still hoping someday for a Google SuperScholar system that takes all my papers, notes, etc. and figures out what is most related on the Web and in every digital library in existence and sends me weekly updates with precise summaries of why the information found is relevant. Maybe it should be called Google ScholarHeaven.

Wednesday, January 17, 2007

Scratch is for kids!

MIT researchers have recently released a new version of Scratch, a graphical programming language targeted to children. Scratch allows you to create interactive programs that use animation and sound. I especially like the programming language (shown on the right) that allows you to drag and drop constructs to create a program. It sure beats moving a turtle around the screen (how I miss those days at computer camp... smile)!

Thursday, January 11, 2007

2007 is poised to be the year of spam. Although Bill Gates thought we’d have the problem licked by now, the problem is only getting worse. I’m now getting approximately 50-60 spam emails per day, and although my spam filter catches a lot of it, a new form of spam is regularly beating the filter: image spam. Spammers are now using attached images to replay their spammy messages. They work like a captcha for your filter- since your filter is good at reading text but lousy at reading images, there’s little you can do to stop well-designed image spam.

What really bugs me is when companies that spam claim they don’t. For example, I received the following spam about 10 times over the past few days:

Notice the clever text to the side of the image. They change that each time to keep my filter from figuring out this is spam. Now if you'll visit their website, you’ll see that they have a link that allows you to report spam:
Pharmacy operates a strict anti-spam policy. We do not tolerate unsolicited advertising messages. We will actively pursue anyone engaging in spamming activities! This includes email, icq, instant messengers, chat rooms, message boards, newsgroups or anywhere else where commercial postings are prohibited. We will take appropriate actions against spammers that will result in loss of services and accounts closure.
Sure they will... they ask for your name, email address, and phone number, just so they can spam you some more with unsolicited phone calls while you’re eating dinner!

Once you submit your information, they reply “Thank you for your patience.” I think these guys may be qualified for the ninth ring.

Tuesday, January 09, 2007

Use Wikipedia to make your computer smarter

Still on the Wikipedia kick... Researchers from Technion-Israel Institute of Technology are using Wikipedia to give computers context information and make connections between different words. See the article here. For example, when a spam filter encounters a word like “B12” and needs to determine if the email should be marked as spam, the filter currently doesn’t know that B12 is a vitamin (the subject of many spam emails) unless the email also uses the term vitamin. But by examining the Wikipedia article on B12, the spam filter could be smart and deduce that an email with B12 is trying to sell vitamins. The same information could have been obtained by searching for B12 using a search engine, but the results aren’t necessarily vetted. That’s the Wikipedia advantage.

Monday, January 08, 2007

Reading lists from Wikipedia

Alexander D. Wissner-Gross, a physics Ph.D. student at Harvard, presented his paper this summer entitled Preparation of Topical Reading Lists from the Link Structure of Wikipedia at ICALT'06. Wissner-Gross shows how an algorithm based on PageRank can be used to generate background reading lists from Wikipedia. I especially like this paper because it is the solution to a real teaching problem that Wissner-Gross encountered when preparing to teach one of his courses: how can we automate the time-consuming process of generating a quality reading list for a class?

Update on 1/30/07:

Wissner-Gross emailed me this morning with the web address of the reading list engine:

I got some interesting results for Digital preservation. Although Digital obsolescence popped up first, some irrelevant results like Vanderbilt University and University of Virginia also popped up. A search for web crawling brought up 2003 as a result. I'm not sure if these lists would be more useful than if I looked directly at the See also section, but it's still an interesting idea.

Tuesday, January 02, 2007

Wiki my search

Be on the lookout... a wiki-inspired search engine called Wikiasari (no web address yet) is going to be launched early this year. Since Wikipedia founder Jimmy Wales is behind the project, it’s already starting to create some noise.

It sounds like a great idea: apply the wisdom of crowds to search engines results. Of course this is already what search engines are attempting to do when they track which search results you click on or use link analysis (how many and what types of links are pointing to a page) to determine what are the best results to a particular query.

The problem will be eliminating the rich-get-richer phenomenon on the Web which makes it difficult for new pages to rise to the top. You can image a new page about Britney Spears that is of high quality (can a page about Britney be high quality? ), but it won’t be displayed to searchers since the top 10-20 results already have been voted to be the best.

And how do you get users to evaluate the relevance of results on the third or forth set of results? Studies have shown users rarely go beyond the first page or two of results. Some very interesting problems indeed.

Monday, January 01, 2007

List of banished words for 2007

Lake Superior State University has once again posted it’s list of banished words for the new year. Last year they banned my nickname Dawg McCown, and now my friends can’t call me i-Frank or refer to me and my wife as Frecky… how disappointing.

Here are the words and phrases to be banished from the Queen’s English in 2007 for mis-use, over-use and general uselessness:
  • Gitmo
  • Combined celebrity names
  • Awesome
  • Gone/went missing
  • Pwn or pwned
  • Now playing in theaters
  • We're pregnant
  • Undocumented alien
  • Armed robbery/drug deal gone bad
  • Truthiness
  • Ask your doctor
  • Chipotle
  • i-anything
  • Search
  • Healthy food
  • Boasts
Happy new year, everyone!