Tuesday, June 10, 2008

Using Google's AJAX Search API with Java

I was rather sad a year ago when Google deprecated their SOAP Search API with their AJAX Search API. Essentially Google was saying that they didn't want anyone programmatically accessing Google search results unless they were going to be presenting the results unaltered in a rectangular portion of a website. This was particularly troubling to me because, like many academics, I have relied on the API to do automated queries, especially for Warrick.

A few months ago I got a little excited when Google opened their AJAX API to non-JavaScript environments. Google is now allowing queries using a REST-based interface that returns search results using JSON. The purpose of this API is still to show unaltered results to your website's user, but I don't see anything in the Terms of Use that prevent the API being used in an automated fashion (having a program regularly execute queries), especially for research purposes, as long as you aren't trying to make money (or prevent Google from making money) from the operation.

UPDATE: The AJAX web search API has been deprecated as of November 1, 2010. I do not know of a suitable replacement.

So, here's what I've learned about using the Google AJAX Search API with Java. I haven't found this information anywhere else on the Web in one spot, so I hope you'll find it useful.

Here's a Java program that queries Google three times. The first query is for the title of this blog (Questio Verum). The second query asks Google if the root page has been indexed, and the third query asks how many pages from this website are indexed. (Please forgive the poor formatting... Blogger thinks it knows better than I how I want my text indented. Argh.)

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import org.json.JSONArray; // JSON library from http://www.json.org/java/
import org.json.JSONObject;

public class GoogleQuery {

// Put your website here
private final String HTTP_REFERER = "http://www.example.com/";

public GoogleQuery() {
makeQuery("questio verum");
makeQuery("info:http://frankmccown.blogspot.com/");
makeQuery("site:frankmccown.blogspot.com");
}

private void makeQuery(String query) {

System.out.println("\nQuerying for " + query);

try
{
// Convert spaces to +, etc. to make a valid URL
query = URLEncoder.encode(query, "UTF-8");

URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=" + query);
URLConnection connection = url.openConnection();
connection.addRequestProperty("Referer", HTTP_REFERER);

// Get the JSON response
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}

String response = builder.toString();
JSONObject json = new JSONObject(response);

System.out.println("Total results = " +
json.getJSONObject("responseData")
.getJSONObject("cursor")
.getString("estimatedResultCount"));

JSONArray ja = json.getJSONObject("responseData")
.getJSONArray("results");

System.out.println("\nResults:");
for (int i = 0; i < ja.length(); i++) {
System.out.print((i+1) + ". ");
JSONObject j = ja.getJSONObject(i);
System.out.println(j.getString("titleNoFormatting"));
System.out.println(j.getString("url"));
}
}
catch (Exception e) {
System.err.println("Something went wrong...");
e.printStackTrace();
}
}

public static void main(String args[]) {
new GoogleQuery();
}
}

Note that this example does not use a key. Although it is suggested you use one, you don't have to. All that is required is that you put your website or the URL of the webpage that is making the query in the query string (coming from the HTTP_REFERER constant).

When you run this program, you will see the following output:
Querying for questio verum

Total results = 1320

Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: URL Canonicalization
http://frankmccown.blogspot.com/2006/04/url-canonicalization.html
3. WikiAnswers - What does questio verum mean
http://wiki.answers.com/Q/What_does_questio_verum_mean
4. Amazon.com: Questio Verum "iracund"'s review of How to Get Happily ...
http://www.amazon.com/review/R3VRSYWW5EJZFH
5. Amazon.com: Profile for Questio Verum
http://www.amazon.com/gp/pdp/profile/A2Q6CLLQPXG55A
6. How and where to get Emerald? - Linux Forums
http://www.linuxforums.org/forum/ubuntu-help/119375-how-where-get-emerald.html
7. Lemme hit that wifi, baby! - Linux Forums
http://www.linuxforums.org/forum/coffee-lounge/122922-lemme-hit-wifi-baby.html
8. [SOLVED] lost in tv tuner hell... please help - Ubuntu Forums
http://ubuntuforums.org/showthread.php%3Fp%3D3802299


Querying for info:http://frankmccown.blogspot.com/

Results:
Total results = 1

1. Questio Verum
http://frankmccown.blogspot.com/


Querying for site:frankmccown.blogspot.com

Total results = 463

Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: March 2006
http://frankmccown.blogspot.com/2006_03_01_archive.html
3. Questio Verum: December 2006
http://frankmccown.blogspot.com/2006_12_01_archive.html
4. Questio Verum: June 2006
http://frankmccown.blogspot.com/2006_06_01_archive.html
5. Questio Verum: October 2007
http://frankmccown.blogspot.com/2007_10_01_archive.html
6. Questio Verum: July 2007
http://frankmccown.blogspot.com/2007_07_01_archive.html
7. Questio Verum: April 2006
http://frankmccown.blogspot.com/2006_04_01_archive.html
8. Questio Verum: July 2006
http://frankmccown.blogspot.com/2006_07_01_archive.html

The program is only printing the title of each search result and its URL, but there are many other items you have access to. The partial JSON response looks something like this:
"GsearchResultClass": "GwebSearch",
"cacheUrl": "http://www.google.com/search?q=cache:Euh9Z1rDeXUJ:frankmccown.blogspot.com",
"content": "<b>Questio Verum<\/b>. The adventures of academia, or how I learned to stop worrying and love teacher evaluations.*. Saturday, June 07, 2008 <b>...<\/b>",
"title": "<b>Questio Verum<\/b>",
"titleNoFormatting": "Questio Verum",
"unescapedUrl": "http://frankmccown.blogspot.com/",
"url": "http://frankmccown.blogspot.com/",
"visibleUrl": "frankmccown.blogspot.com"

So, for example, you could display the result's cached URL (Google's copy of the web page) or the snippet (page content) by modifying the code in the example's for loop.

You'll note that only 8 results are shown for the first and third queries. The AJAX API will only return either 8 results or 4 results (by changing rsz=large to rsz=small in the query string). Currently there are no other sizes.

You can see additional results (page through the results) by changing start=0 in the query string to start=8 (page 2), start=16 (page 3), or start=24 (page 4). You cannot see anything past the first 32 results. In fact, setting start to any value larger than 24 will result in a org.json.JSONException being thrown. (See my update below.)

More info on the query string parameters is available here.

From the limited number of queries I've ran, the the first 8 results returned from the AJAX API are the same as the first 8 results returned from Google's web interface, but I'm not sure this is always so. In other words, I wouldn't use the AJAX API for SEO just yet.

One last thing: the old SOAP API had a limit of 1000 queries per key, per 24 hours. There are no published limits for the AJAX API, so have at it.

Update on 9/11/2008:

Google has apparently increased their result limit to 64 total results. So you can page through 8 results at a time, up to 64 results.