Showing posts with label msn api. Show all posts
Showing posts with label msn api. Show all posts

Tuesday, March 20, 2007

Search engine interfaces and their APIs - How synchronized are they?

A few months ago I gave a little teaser about some research I was doing comparing the results you receive from the search engine APIs of Google, Yahoo, and MSN with the results that you see when you use their web user interface (WUI). The WUI is a fancy term for the little search box that you enter your queries into.

Most API users think that if they search for “march madness”, for example, that the returned results will be equivalent to what they would see if they searched for “march madness” using the WUI. In practice, this rarely occurs.

This leads us to ask, how different are the search engine API results from the WUI results? Are the APIs serving off of older indexes? Smaller indexes? Which search engine offers the most synchronized interfaces?

I will be presenting the answer to these questions at this year’s ACM IEEE Joint Conference on Digital Libraries (JCDL) this June in a paper entitled Agreeing to Disagree: Search Engines and their Public Interfaces. I’ll also be presenting a summary of my findings as a poster at the World Wide Web conference in May. Detailed findings can be found here. If you attend either of these conferences, please come by and introduce yourself... I’d be happy to discuss my findings with you.

By the way, if you haven't heard already, Google's SOAP web search API has been "depricated."

Friday, November 03, 2006

Do the search engine APIs lie?

OK, the title of this post is a little strong. Search engine APIs don't intend to deceive anyone, but they typically do not give the same result as what the rest of the world sees when using the public web interfaces.

Everyday for the past 5 months I’ve been sending thousands of queries to the Google, MSN, and Yahoo on the Internets using the web user interface (WUI), the little box that everyone types their queries into, and using the web search APIs that each of the search engines makes available for free to the public. There’s been a lot of questions as to whether the APIs give the same results as the WUIs, and I’m going to be the first to provide a strong quantitative analysis to see which API's are the most synchronized with their WUIs.

In order to process the incredible amount of data I’ve been collecting, I’ve developed an elaborate set of Perl scripts that transform the raw collected data into tables that are then imported into MySQL. The scripts take several days to complete processing. Then I’ve developed numerous R scripts that pull data from MySQL and plot them to an array of graphs.

I’m currently working on writing up my findings for a conference. If you’d like a pre-print of my paper, I’d be happy to share it with you. Here’s a little teaser.


The graph above shows the daily Kendall tau distance between the top 100 search results obtained from Google’s WUI and API for the term carmen electra. The green line shows how the WUI results change every day, and the blue like shows how the API results change every day. If the results are exactly the same (including their ranking), the distance is 1, but if the results have nothing in common, the distance is 0. The red line shows the distance between the WUI and API results each day. You’ll notice that for the most part the WUI and API values don’t move in a synchronized way, and the WUI and API results are very dissimilar. Other popular search terms like stacy keibler, jessica simpson, and lindsay lohan exhibited similar patterns (although the WUI vs API distance was closer to about 0.8). When we examine search results for terms like nfl football or computational complexity, the WUI and API results are very synchronized, and the WUI vs API distance is closer to 0.9. Maybe they purposefully discriminate against air-heads?



This graph shows the decay of the search results for the term subroutine for all three search engines. To compute decay, I compared the results obtained on each day with each of the results after that day using a normalized overlap measure. In other words, I computed the percentage of results that were shared between the results obtained on day 1 with day 2, 3, 4, etc. Yahoo shows a strong decay line with a half-life of 30 (on day 30 half of the results were gone). Google and MSN show decay lines that actually un-decay (if there is such a word). After several months of the results becoming more different, the results start to return back to their starting point.



One last graph: how many times does the WUI and API agree when asked for the total number of results for a search term? For all three search engines, the answer is almost always zero! But if you look at the graph above, you’ll see that the MSN total results used to agree almost every time until day 58 (late July) when they changed something internally. Now about half of the time their WUI gives a larger number, and the half of the time the API gives a larger number. By the way, the gap under day 107 was due to MSN invalidating our API license key. It took me 17 days before I replaced the key. Moral of the story- keep a close eye on your experiments!