A few months ago I got a little excited when Google opened their AJAX API to non-JavaScript environments. Google is now allowing queries using a REST-based interface that returns search results using JSON. The purpose of this API is still to show unaltered results to your website's user, but I don't see anything in the Terms of Use that prevent the API being used in an automated fashion (having a program regularly execute queries), especially for research purposes, as long as you aren't trying to make money (or prevent Google from making money) from the operation.
UPDATE: The AJAX web search API has been deprecated as of November 1, 2010. I do not know of a suitable replacement.
So, here's what I've learned about using the Google AJAX Search API with Java. I haven't found this information anywhere else on the Web in one spot, so I hope you'll find it useful.
Here's a Java program that queries Google three times. The first query is for the title of this blog (Questio Verum). The second query asks Google if the root page has been indexed, and the third query asks how many pages from this website are indexed. (Please forgive the poor formatting... Blogger thinks it knows better than I how I want my text indented. Argh.)
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import org.json.JSONArray; // JSON library from http://www.json.org/java/
import org.json.JSONObject;
public class GoogleQuery {
// Put your website here
private final String HTTP_REFERER = "http://www.example.com/";
public GoogleQuery() {
makeQuery("questio verum");
makeQuery("info:http://frankmccown.blogspot.com/");
makeQuery("site:frankmccown.blogspot.com");
}
private void makeQuery(String query) {
System.out.println("\nQuerying for " + query);
try
{
// Convert spaces to +, etc. to make a valid URL
query = URLEncoder.encode(query, "UTF-8");
URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=" + query);
URLConnection connection = url.openConnection();
connection.addRequestProperty("Referer", HTTP_REFERER);
// Get the JSON response
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}
String response = builder.toString();
JSONObject json = new JSONObject(response);
System.out.println("Total results = " +
json.getJSONObject("responseData")
.getJSONObject("cursor")
.getString("estimatedResultCount"));
JSONArray ja = json.getJSONObject("responseData")
.getJSONArray("results");
System.out.println("\nResults:");
for (int i = 0; i < ja.length(); i++) {
System.out.print((i+1) + ". ");
JSONObject j = ja.getJSONObject(i);
System.out.println(j.getString("titleNoFormatting"));
System.out.println(j.getString("url"));
}
}
catch (Exception e) {
System.err.println("Something went wrong...");
e.printStackTrace();
}
}
public static void main(String args[]) {
new GoogleQuery();
}
}
Note that this example does not use a key. Although it is suggested you use one, you don't have to. All that is required is that you put your website or the URL of the webpage that is making the query in the query string (coming from the HTTP_REFERER constant).
When you run this program, you will see the following output:
Querying for questio verum
Total results = 1320
Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: URL Canonicalization
http://frankmccown.blogspot.com/2006/04/url-canonicalization.html
3. WikiAnswers - What does questio verum mean
http://wiki.answers.com/Q/What_does_questio_verum_mean
4. Amazon.com: Questio Verum "iracund"'s review of How to Get Happily ...
http://www.amazon.com/review/R3VRSYWW5EJZFH
5. Amazon.com: Profile for Questio Verum
http://www.amazon.com/gp/pdp/profile/A2Q6CLLQPXG55A
6. How and where to get Emerald? - Linux Forums
http://www.linuxforums.org/forum/ubuntu-help/119375-how-where-get-emerald.html
7. Lemme hit that wifi, baby! - Linux Forums
http://www.linuxforums.org/forum/coffee-lounge/122922-lemme-hit-wifi-baby.html
8. [SOLVED] lost in tv tuner hell... please help - Ubuntu Forums
http://ubuntuforums.org/showthread.php%3Fp%3D3802299
Querying for info:http://frankmccown.blogspot.com/
Results:
Total results = 1
1. Questio Verum
http://frankmccown.blogspot.com/
Querying for site:frankmccown.blogspot.com
Total results = 463
Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: March 2006
http://frankmccown.blogspot.com/2006_03_01_archive.html
3. Questio Verum: December 2006
http://frankmccown.blogspot.com/2006_12_01_archive.html
4. Questio Verum: June 2006
http://frankmccown.blogspot.com/2006_06_01_archive.html
5. Questio Verum: October 2007
http://frankmccown.blogspot.com/2007_10_01_archive.html
6. Questio Verum: July 2007
http://frankmccown.blogspot.com/2007_07_01_archive.html
7. Questio Verum: April 2006
http://frankmccown.blogspot.com/2006_04_01_archive.html
8. Questio Verum: July 2006
http://frankmccown.blogspot.com/2006_07_01_archive.html
The program is only printing the title of each search result and its URL, but there are many other items you have access to. The partial JSON response looks something like this:
"GsearchResultClass": "GwebSearch",
"cacheUrl": "http://www.google.com/search?q=cache:Euh9Z1rDeXUJ:frankmccown.blogspot.com",
"content": "<b>Questio Verum<\/b>. The adventures of academia, or how I learned to stop worrying and love teacher evaluations.*. Saturday, June 07, 2008 <b>...<\/b>",
"title": "<b>Questio Verum<\/b>",
"titleNoFormatting": "Questio Verum",
"unescapedUrl": "http://frankmccown.blogspot.com/",
"url": "http://frankmccown.blogspot.com/",
"visibleUrl": "frankmccown.blogspot.com"
So, for example, you could display the result's cached URL (Google's copy of the web page) or the snippet (page content) by modifying the code in the example's for loop.
You'll note that only 8 results are shown for the first and third queries. The AJAX API will only return either 8 results or 4 results (by changing rsz=large to rsz=small in the query string). Currently there are no other sizes.
You can see additional results (page through the results) by changing start=0 in the query string to start=8 (page 2), start=16 (page 3), or start=24 (page 4). You cannot see anything past the first 32 results. In fact, setting start to any value larger than 24 will result in a org.json.JSONException being thrown. (See my update below.)
More info on the query string parameters is available here.
From the limited number of queries I've ran, the the first 8 results returned from the AJAX API are the same as the first 8 results returned from Google's web interface, but I'm not sure this is always so. In other words, I wouldn't use the AJAX API for SEO just yet.
One last thing: the old SOAP API had a limit of 1000 queries per key, per 24 hours. There are no published limits for the AJAX API, so have at it.
Update on 9/11/2008:
Google has apparently increased their result limit to 64 total results. So you can page through 8 results at a time, up to 64 results.
It really helps. Thank you/
ReplyDeleteVery nicely done. I appreciate the time you took in developing this idea, and posting it.
ReplyDeleteThank you for taking to give us a very useful and well written code.
ReplyDeleteDo you think Google will expand the results beyond 32?
I don't think Google has any incentive to provide more that 32 results total since most users don't go beyond the first few sets of results. So you're left with screen-scraping if you really need more.
ReplyDeleteThanks for the great code!
ReplyDeleteThat was really nice..!!!
ReplyDeleteThanks for posting such an useful topic and also covering every pros and cons of this aproach.
Do you find any way to get the results beyond fourth page.
Is there any other service that google provides for making use of their search feature.
Thanks in advance.
Is advanced search feature available in the same way..
ReplyDeletePlease reply
Con- You cannot get more than 32 results. There is a researcher API, but it is only for academic purposes.
ReplyDeleteIts really very much helpful in understanding the basic of Ajax search API, I whole heartedly thanks Frank Mccown for his remarkable work.
ReplyDeleteSome concern i have about google ajax search is :
1. if we are using JSON api then only Google gives back total 32 results even though the overall results may be 23,0043 ?.
Or
2. if we use direct Google ajax api in javascript and ajax, then total results will be fetched from Google ?.
3. can we at a time fetch 200 or 100 results using JSON api + google ajax search api ?.
I have used JSON api for google ajax websearch here i am able to set start index till 56, and i am able to get results, but if i set the start index at 64 then i got exception stating response object is not JSONObject.
ReplyDeleteso this means we can fetch results till 56 ?
It looks like Google's API is now returning 64 results total instead of 32. So if you tell the API to start giving you results starting at 64, the API will burp.
ReplyDeleteWahed- I'm not sure I understand your questions. JSON is just the format that the Google AJAX API uses to return results. You cannot get more than 64 results total (8 at a time), even if Google says there are 500 results.
If you use Google's human user interface, you can page through up to 1000 results. Yahoo and Live also will limit you to 1000 results total.
I am getting following error. Please help me.
ReplyDeleteQuerying for questio verum
Something went wrong...
java.net.NoRouteToHostException: No route to host: connect
thank you
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteRegarding in the chinese search, while I use makeQuery("中文") , it will occur error at run time.
ReplyDeleteDo I miss something? like the language parameter of google?
pQuery = URLEncoder.encode("中文", "UTF-8");
Deleteafter pass the string for google url. I hope you will get results
NYC- You may want to check out
ReplyDeletehttp://code.google.com/intl/zh-CN/
Thanks for providing the link for university search api
ReplyDeletethank you for this code, you saved me a lot of time! / grazie mille per la tua spiegazione e il codice, mi hai risparmiato molto lavoro! ciao!
ReplyDeletehi,
ReplyDeletei have used that code. but the estimatedResultCount is differ from the actual google site. can you help me to get the same number of response as i am getting from the google site.
Thank you so much! This helped me alot!
ReplyDeleteThanks so much for the useful post. Btw, have you tried to execute thousands of automated queries continuously using Google Ajax API? I doubt whether Google will make some restrictions on this kind of connection.
ReplyDeleteI have not automated thousands of requests to Google, but I don't think Google would get upset if you did, as long as you kept the number within reason (a few thousand a day).
ReplyDeleteOh, it is really okay. Today I executed batch of 1000 queries two times and nothing wrong happened.
ReplyDeleteFrank Thanks a Million.
ReplyDeleteI have been searching for sample code and yours was a great help!
By any chance, off the top of your head, do you know how I can replicate exact search? the equivalent of "wordX wordY" in google search.
I am working on a linguistic project and I am looking for bigrams, (word combinations used in English).. The idea is to replace bigram probabilities from trained corpuses by some pseudo probability score based on google search result counts. so the count for wordX or wordY will not do.
Thanks!
Never mind previous query. Just queried with "\""+ query string+ "\"".
ReplyDeleteThanks.
Hi!!
ReplyDeleteThank you for your precious help!!
but I am getting an error :(
i am using JSON library net.sf.json instead of org.json. The constructor doesn't work in your example cause it doesn't accept a String as a parameter
String response = builder.toString();
JSONObject json = new JSONObject(response);
any suggestion? what is the difference between org.json and net.sf.json ???
Thank u for your time!!
This comment has been removed by the author.
ReplyDeleteThe problem was solved by using json-lib-1.1-jdk15.jar instead of 2.3 version.
ReplyDeleteSomehow the new version doesn't accept JSONObject(String) constructor.
why? I Don't know :P
Hi everybody!
ReplyDeleteReally usefull information, but I have a doubt:
Is there any way to get all the content of a new? the content I get with 'j.getString("content")' is just the abstract and what I want is the whole content of the new, is this posible??
Thank u!
No, you cannot get the entire content of the result from the API. You must download the content yourself.
ReplyDeleteThank u Frank.
ReplyDeleteWhat do you think is the best way to download the full content of a new?
I have searched the web but I haven't seen any solution like this. Great job. Congratulations. You are the best man :)
ReplyDeleteMany Thanks,
ReplyDeleteIt is really Great Job!
Million times thanks buddy!! So grateful for this. Thanks again! :)
ReplyDeleteThanks So much! It saved me a lot of time.
ReplyDeletehey thanks for this valuable information....
ReplyDeleteHello Frank!!. I would like to know,
ReplyDelete1. for what purpose does 'HTTP_REFERER' is used????
2. and there are much difference in the order of search results displayed when compared to actual google search and from this application. Why is it so???
By the way, your code was of so much use to me, in regard with my web mining project(currently doing). :) Thanks a ton for that..!!!
Syed, glad the code is helpful. I don't think Google will respond to requests unless the HTTP_REFERER is set. I'm not sure how similar the results are from regular web search. I wrote a paper about the differences years ago that you might be interested in reading:
ReplyDeleteAgreeing to Disagree: Search Engines and their Public Interfaces
http://www.harding.edu/fmccown/pubs/se-apis-jcdl07.pdf
Yeah!!! Thanks Frank. Will surely go through the paper and will get back to you with more queries, if i have any. :)
ReplyDeleteHello Frank!! I have come across some ideas to extract useful content from webpages, by converting HTML source of the webpage to XML, and then applying heuristics to the XML document. I would like to know if you have come across any such approach?? Or any other improved approach??
ReplyDeleteSyed- I'm not aware of any specific research using this method, but a search in Google Scholar would help you find all kinds of research used to mine data from the Web.
ReplyDeleteHi frank , great info ,
ReplyDeleteI would like to know is there a possibility to add a site restriction while constructing the URL ?
In JavaScript its possible to get the Google object and put a site restriction , How to do that in Java , any idea ?
Have you tried using the query "site:mysite.com foo" which searches mysite.com for the word "foo"?
ReplyDeleteHi Frank
ReplyDeleteThank you for the code..
But may i know why the result count varies with actual Google result?
Hi Frank, thank you for the code.
ReplyDeleteCan i use this api in a webPage and pay money to Google!?
Anonymous- Like I said earlier in one comment, what the API produces and the web interface produces are often going to be different.
ReplyDeleteKarim- I don't know anything about paying money to Google.
Hi Frank,
ReplyDeleteYour code was very helpful. Thanks!
Do the code need any update since Google is moving towards its "New Custom Search API"?
http://googlecode.blogspot.com/2010/11/introducing-google-apis-console-and-our.html
I'm not sure how the new API will affect this code. I would just keep using it until it breaks.
ReplyDeletehi, do u know where i can find the library? i couldn find though...thanks for your attention..
ReplyDeleteYou can find the API here: http://code.google.com/apis/websearch/
ReplyDeleteAs I've noted in the body of this blog post, the API was deprecated in Nov 2010. I'm not sure if there is another API that allows you to do general web search using Google.
I am using a network which requires proxy ip, port and authentication. How can i make changes in code to run on my network?
ReplyDeleteSorry, Anonymous... can't help you there.
ReplyDeletehello..i was wondering whether there was a way to capture the url clicked in the results returned using java?or is it possible with the javascript api only?
ReplyDeleteThank you very much fot this helpful topic. The problem is that I find only 4 results !!!! How can I increase the numbre of results?? Thanks in advance
ReplyDeleteReally very helpful to us. But we need how to fetch more than 8 data's
ReplyDeleteVery useful Information Frank! Thanks a ton!
ReplyDeletethx Frank, it's really helpful!!!!
ReplyDeletebut as a new learner, i m confused about if i need print the result one by one into the list (e.g. jList..), what should i do?
i want to get fetch total number of count against google search engine but i am not able to get count after 10 results count.
ReplyDeletei hope you are having solution
thanks it very helpful but i still have a little problem i hope register results in JSON file
ReplyDeleteany one has idea?
Hi
ReplyDeleteThanks for the great article. I have a problem even if i change start=32, I get only 8 results. Can you tell me what could be the problem?