A few months ago I gave a little teaser about some research I was doing comparing the results you receive from the search engine APIs of Google, Yahoo, and MSN with the results that you see when you use their web user interface (WUI). The WUI is a fancy term for the little search box that you enter your queries into.
Most API users think that if they search for “march madness”, for example, that the returned results will be equivalent to what they would see if they searched for “march madness” using the WUI. In practice, this rarely occurs.
This leads us to ask, how different are the search engine API results from the WUI results? Are the APIs serving off of older indexes? Smaller indexes? Which search engine offers the most synchronized interfaces?
I will be presenting the answer to these questions at this year’s ACM IEEE Joint Conference on Digital Libraries (JCDL) this June in a paper entitled Agreeing to Disagree: Search Engines and their Public Interfaces. I’ll also be presenting a summary of my findings as a poster at the World Wide Web conference in May. Detailed findings can be found here. If you attend either of these conferences, please come by and introduce yourself... I’d be happy to discuss my findings with you.
By the way, if you haven't heard already, Google's SOAP web search API has been "depricated."
What about result freshness? Are results returned by, say, WUIs "fresher" than APIs results or vice versa?
ReplyDeleteIf you mean freshness to be the lag time between crawling a page and the page's last modified date (as discussed here), we did not examine the crawl dates in the cached page headers. Experience tells me the dates are equal, but I don't know for sure in all cases.
ReplyDeleteI did do an interesting study though that examines the freshness of the search engine caches that you may be interested in.
Ok, I'll definitely read your papers&the paper in that post.
ReplyDeleteBTW, can you recommend any good works devoted to estimations of search engine index sizes? I'm mostly interested in estimates for the total number of web sites (NOT the total number of web pages indexed), that is, how many web sites have at least one page indexed by a particular major web crawler?
I've seen several papers talking about the size of search engine indexes in terms of pages, but none in terms of websites. If you come across a paper like that, please let me know.
ReplyDeleteI see.
ReplyDeleteBut, for example, what about the Internet Archive? (looks like your group is collaborating with archive.org, isn't it?) Currently the archive.org front page says about 85 billion pages - do they have stats for the number of archived web sites? It would be an accumulative number obviously but it wouldn't be so bad when no any other estimates are around ...
And concerning your work about search engines WUI/APIs. It was good reading and it's very carefully done (though personally I don't see a big problem in WUI/API discrepancies but at the same time I should acknowledge that it is important to know about the fact that WUI results may be different from API ones). Anyhow, I don't think your experiments are enough to claim that the Google/Yahoo API indices are probably smaller than their correspinding WUI indices. The thing is that in all the experiments you studied just first 100 results (provided by the WUI interfaces as well as by the APIs - am I correct here?), and hence have no idea about the rest of results. So according to your data, yes, indeed, top results from API and from WUI are different from each other. However, it doesn't necessarily mean the difference in the corresponding index sizes. For instance, for me it looks like there meight be two somehow different algorithms of results ranking on the top of the same index. Such a difference can be even explained in a way that WUI returns sponsored results (the point here might be that adding sponsored results complicates the ranking process) while API doesn't.
And finally, what is your preferred API? Which API (Google,Yahoo,MSN) would you recommend to use with research purposes? Is it necessary to use more than one API or, generally, the indices of major search engines are pretty overlapping and thus just one API is quite ok?
My apologies for this long comment and thank you for yesterday's replies.
The Internet Archive is currently working on figuring out some stats on their web archive and will probably publish it soon. As soon as I see something I'll definitely blog about it.
ReplyDeleteI was careful in the papers to say the indexed were probably smaller, not definitely. The data from the top 100 results wasn't what I used to make that conclusion... it was based on the estimated total results returned for the terms and for the total number of pages indexed from each site. Since these are just estimates, they can't be used conclusively. I suggested in the paper a better experiment to make a more conclusive statement.
Choosing an API depends on what you want to do. Google (arguably) has the widest coverage, but their API suffers from lack of keys and the inconsistencies I presented in the paper. MSN is the most synchronized, yet their index size is (arguably) smaller than Google. There's a table in the paper that states which APIs are the most synchronized, and I'd base my decision on that (except stay away from Google when examining backlinks).
Whatever API you choose, you could just cite my paper and say you are aware of some inconsistencies, but it's the best you can do since search engines don't want you to scrape their WUIs. :-)