Friday, November 03, 2006

Do the search engine APIs lie?

OK, the title of this post is a little strong. Search engine APIs don't intend to deceive anyone, but they typically do not give the same result as what the rest of the world sees when using the public web interfaces.

Everyday for the past 5 months I’ve been sending thousands of queries to the Google, MSN, and Yahoo on the Internets using the web user interface (WUI), the little box that everyone types their queries into, and using the web search APIs that each of the search engines makes available for free to the public. There’s been a lot of questions as to whether the APIs give the same results as the WUIs, and I’m going to be the first to provide a strong quantitative analysis to see which API's are the most synchronized with their WUIs.

In order to process the incredible amount of data I’ve been collecting, I’ve developed an elaborate set of Perl scripts that transform the raw collected data into tables that are then imported into MySQL. The scripts take several days to complete processing. Then I’ve developed numerous R scripts that pull data from MySQL and plot them to an array of graphs.

I’m currently working on writing up my findings for a conference. If you’d like a pre-print of my paper, I’d be happy to share it with you. Here’s a little teaser.

The graph above shows the daily Kendall tau distance between the top 100 search results obtained from Google’s WUI and API for the term carmen electra. The green line shows how the WUI results change every day, and the blue like shows how the API results change every day. If the results are exactly the same (including their ranking), the distance is 1, but if the results have nothing in common, the distance is 0. The red line shows the distance between the WUI and API results each day. You’ll notice that for the most part the WUI and API values don’t move in a synchronized way, and the WUI and API results are very dissimilar. Other popular search terms like stacy keibler, jessica simpson, and lindsay lohan exhibited similar patterns (although the WUI vs API distance was closer to about 0.8). When we examine search results for terms like nfl football or computational complexity, the WUI and API results are very synchronized, and the WUI vs API distance is closer to 0.9. Maybe they purposefully discriminate against air-heads?

This graph shows the decay of the search results for the term subroutine for all three search engines. To compute decay, I compared the results obtained on each day with each of the results after that day using a normalized overlap measure. In other words, I computed the percentage of results that were shared between the results obtained on day 1 with day 2, 3, 4, etc. Yahoo shows a strong decay line with a half-life of 30 (on day 30 half of the results were gone). Google and MSN show decay lines that actually un-decay (if there is such a word). After several months of the results becoming more different, the results start to return back to their starting point.

One last graph: how many times does the WUI and API agree when asked for the total number of results for a search term? For all three search engines, the answer is almost always zero! But if you look at the graph above, you’ll see that the MSN total results used to agree almost every time until day 58 (late July) when they changed something internally. Now about half of the time their WUI gives a larger number, and the half of the time the API gives a larger number. By the way, the gap under day 107 was due to MSN invalidating our API license key. It took me 17 days before I replaced the key. Moral of the story- keep a close eye on your experiments!