Friday, June 27, 2008

Fav5

My pick of the top five items of interest for the week:
  1. The Los Alamos National Laboratory (where I'm currently working) is now home to the world's fastest supercomptuter: the Roadrunner. This machine is the first to break the 1 petaflop barrier.

  2. Live Search Academic is no more. Apparently Microsoft is giving up on their Google Scholar competitor, and for good reason: its interface was horrible and coverage second-rate. I hate to be so critical, but it really was never a match for Google Scholar.

  3. Scotland's failing exam rates are being blamed on students copying information from websites like Wikipedia and passing it off as their own.

  4. Nicholas Carr ponders, Is Google Making Us Stupid? Actually the question being asked is if the Web has somehow changed the way we read and process information.
    The Internet is a machine designed for the efficient and automated collection, transmission, and manipulation of information, and its legions of programmers are intent on finding the “one best method”—the perfect algorithm—to carry out every mental movement of what we’ve come to describe as “knowledge work.”
  5. Gwap, a collection of games that use human intelligence to improve artificial intelligence, was recently launched by Carnegie Mellon's School of Computer Science. Readwrite has more about it.

Wednesday, June 25, 2008

GUI blooper: HTML2Image

There are more than a few GUI bloopers for this shareware product which takes snap-shots of web pages and stores them as images. A screen shot of the application is below.

Blooper #1: Disabling controls

Notice that all the controls are enabled in the "Save As Image" group box. But if you try to work with any of the controls, they will ignore your input. Only when the "Save As Image" group is checked will the controls respond. This is a no-no: the controls should be disabled and grayed-out to show that they are not accessible until "Save As Image" is checked.

Additionally the spinner to the right of the "Crop Height" checkbox should be grayed out when "Crop Height" is not checked.

Blooper #2: Grammar


Programmers often overlook grammar and spelling, especially since spell checkers are not often part of programming environments. In this case the help hint that appears above reads "the same time you capturing main image" when of course it should read "the same time you are capturing the main image." Also, what is the "main" image? This vocabulary is not used anywhere on the GUI and could confuse the user.

Blooper #3: Feedback

And finally, once the user clicks on the Save Image button, there's no indication that the somewhat time consuming job has completed. The status bar at the bottom should at least say "Image saved." or something to that effect. Otherwise the user sits there, staring at the window, wondering if the job has completed or not.

As always, if any of you have other GUI bloopers to share, my mailbox is always open.

Monday, June 23, 2008

Archive spam

So that's why Hanzo:web has been down...


Spam: A tragedy of the commons. Looks like they are not the only web archiving site to be plagued by spam; Spurl.net has had this notice posted on their website for months which apologizes for offering "reduced functionality due to heavy spam attacks":



Update 7-1-08:

I received an email from Mark Middleton this morning informing me that Hanzo:web's free archiving service has been discontinued. There's really no other free web archiving service to take Hanzo:web's place- WebCite is mainly for academic citations, Furl and Spurl only archive since HTML pages and don't make archived materials publicly available, and Archive-It charges a subscription fee.

Saturday, June 21, 2008

Back from JCDL 2008

I'm back from JCDL. It was an enjoyable week, especially since Becky was able to join me. It was great seeing so many people from previous years as well. When the JCDL crowd was asked Thurs at lunch what their favorite thing was at JCDL, someone yelled out "the people." I would have to agree.

I received some good feedback on my two talks. A number of people told me they thought injecting a website's server components was really a neat idea. One individual said he might experiment with encoding data into a YouTube video in a similar experiment. I noticed yesterday that a number of individuals (some from the conference) have submitted websites to reconstruct to Warrick.

Pittsburgh is truly an under-rated as a city... it is one of the more beautiful cities I've visited. They have over 400 bridges, more than any city in the world US.

Wed night the conference attendees ate dinner on a boat that bounced around the three rivers cutting through Pittsburgh. Heinz Field (where the Stealers play) and PNC Park (where the Pirates play) are located directly off the river and have excellent views of the city.

Thurs night Becky and I rode the Duquesne Incline to the Mount Washington neighborhood and had a nice dinner overlooking the city. We didn't have a camera, but the picture above is what we saw. We walked about a mile as the sun set and took the Monongahela Incline back down.

Becky is going to fly up with Ethan to New Mexico tomorrow. Can't wait to see my boy!

Sunday, June 15, 2008

JCDL in Pittsburgh

I'm in Pittsburgh for JCDL 2008. Tomorrow I'll be participating in the Doctoral Consortium, this time as a committee member (I participated as a student two years ago). I'm chairing a session on Tues (Automatic Tools for Digital Libraries) and presenting a paper in the afternoon (Recovering a Website's Server Components from the Web Infrastructure).

Becky will be flying in Tues evening. I'm really looking forward to seeing her since we've been apart now for two weeks. And she'll get to attend the conference dinner and see my second presentation on Thurs (Usage Analysis of a Public Website Reconstruction Tool).

I'm also looking forward to seeing the ol' ODU gang- Michael, Joan, and Martin.

I'll report later on some of the more interesting presentations.

Saturday, June 14, 2008

TouchGraph Facebook Browser

If you have a Facebook profile and want to see which of your friends are in your "inner circle", you might want to checkout the TouchGraph Facebook Browser. Below is screen shot showing the social connections between me and my "closest" 50 friends.


The connections between nodes is computed using betweenness centrality, a measure of a person's importance in a social network. This measure gives more weight to friends that branch other cliques. I'm not sure what the size of the nodes relate to.

The colors of the nodes indicate cliques/clusters - cliques have many friends within a group of friends but few connections to others outside the group.
  • The purple group are mostly friends who were in the social club Knights with me in college.
  • The red group are mainly active Harding students, employees, or spouses. Becky is in this group.
  • The green group are people I know from when I lived in Denver years ago. My sister Sara is in this group.
  • The blue group are other friends from college.
The large green circles are the networks my friends belong to. Most of the people in this graph are in the Harding or Little Rock network.

When I boil it down to only 4 friends, I'm left with Becky (78 friends in common), Hank Bingham (48), Jim Miller (45), and Mark Elrod (45). What would be interesting is if these connections could be recomputed based on communication levels... which friends do I communicate with the most?

Friday, June 13, 2008

Fav5

My pick of the week's top 5 items of interest:
  1. According to a RAND study, the US is still tops in science and technology. According to the report:
    The United States accounts for 40 percent of the total world's spending on scientific research and development, employs 70 percent of the world's Nobel Prize winners and is home to three-quarters of the world's top 40 universities.
    How long will we remain #1? Not very if we continue to make it difficult for foreigners to study here. Americans are just not majoring in technical fields like computer science like they used to.

  2. Someone has actually made a play about the publicly released AOL search queries from 2006. The play focuses on User 927, one of the "anonymous" users whose queries ranged from innocuous to downright deviant. hmmm.... I think I'll skip this one.

  3. Just this week, an Illinois public official dropped his requests to force MySpace to unveil the creators of several "defamatory" profiles that spoofed his identity. Stinks having web-savy enemies.

  4. Matt Cutts confirmed that the file extension of your web pages is very important to Google. The Googlebot will not crawl pages with extensions for binary content like .exe, .dll, .tar, etc. (at least not yet). Matt gives an interesting example: a URL ending with "/web2.0" will be rejected by Googlebot, but "/web2.0/" will be accepted.

  5. Just for fun: try the Bird Flocking Behavior Simulator. The Java applet models the flocking behavior of birds (blue and green flocks). You can place obstacles in their path (left click), give them food (right click), and add predators (red birds which eat the blue and green birds). I had a little fun trapping several of them inside a barrier. (Yes, I do have better things to do with my time. wink)


Thursday, June 12, 2008

What happens when your archive goes down?



I think it's only temporary, but what if you were relying on Hanzo:web to archive your treasured websites, only to have them phase-out their service or go out of business? Think it unlikely?

(Hanzo:web has been around for at least 3 years, but they're still in beta... sounds like they're pulling a Google.)


Update on 7/1/2008:

No, they're down for good.

Tuesday, June 10, 2008

Using Google's AJAX Search API with Java

I was rather sad a year ago when Google deprecated their SOAP Search API with their AJAX Search API. Essentially Google was saying that they didn't want anyone programmatically accessing Google search results unless they were going to be presenting the results unaltered in a rectangular portion of a website. This was particularly troubling to me because, like many academics, I have relied on the API to do automated queries, especially for Warrick.

A few months ago I got a little excited when Google opened their AJAX API to non-JavaScript environments. Google is now allowing queries using a REST-based interface that returns search results using JSON. The purpose of this API is still to show unaltered results to your website's user, but I don't see anything in the Terms of Use that prevent the API being used in an automated fashion (having a program regularly execute queries), especially for research purposes, as long as you aren't trying to make money (or prevent Google from making money) from the operation.

UPDATE: The AJAX web search API has been deprecated as of November 1, 2010. I do not know of a suitable replacement.

So, here's what I've learned about using the Google AJAX Search API with Java. I haven't found this information anywhere else on the Web in one spot, so I hope you'll find it useful.

Here's a Java program that queries Google three times. The first query is for the title of this blog (Questio Verum). The second query asks Google if the root page has been indexed, and the third query asks how many pages from this website are indexed. (Please forgive the poor formatting... Blogger thinks it knows better than I how I want my text indented. Argh.)

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import org.json.JSONArray; // JSON library from http://www.json.org/java/
import org.json.JSONObject;

public class GoogleQuery {

// Put your website here
private final String HTTP_REFERER = "http://www.example.com/";

public GoogleQuery() {
makeQuery("questio verum");
makeQuery("info:http://frankmccown.blogspot.com/");
makeQuery("site:frankmccown.blogspot.com");
}

private void makeQuery(String query) {

System.out.println("\nQuerying for " + query);

try
{
// Convert spaces to +, etc. to make a valid URL
query = URLEncoder.encode(query, "UTF-8");

URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=" + query);
URLConnection connection = url.openConnection();
connection.addRequestProperty("Referer", HTTP_REFERER);

// Get the JSON response
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}

String response = builder.toString();
JSONObject json = new JSONObject(response);

System.out.println("Total results = " +
json.getJSONObject("responseData")
.getJSONObject("cursor")
.getString("estimatedResultCount"));

JSONArray ja = json.getJSONObject("responseData")
.getJSONArray("results");

System.out.println("\nResults:");
for (int i = 0; i < ja.length(); i++) {
System.out.print((i+1) + ". ");
JSONObject j = ja.getJSONObject(i);
System.out.println(j.getString("titleNoFormatting"));
System.out.println(j.getString("url"));
}
}
catch (Exception e) {
System.err.println("Something went wrong...");
e.printStackTrace();
}
}

public static void main(String args[]) {
new GoogleQuery();
}
}

Note that this example does not use a key. Although it is suggested you use one, you don't have to. All that is required is that you put your website or the URL of the webpage that is making the query in the query string (coming from the HTTP_REFERER constant).

When you run this program, you will see the following output:
Querying for questio verum

Total results = 1320

Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: URL Canonicalization
http://frankmccown.blogspot.com/2006/04/url-canonicalization.html
3. WikiAnswers - What does questio verum mean
http://wiki.answers.com/Q/What_does_questio_verum_mean
4. Amazon.com: Questio Verum "iracund"'s review of How to Get Happily ...
http://www.amazon.com/review/R3VRSYWW5EJZFH
5. Amazon.com: Profile for Questio Verum
http://www.amazon.com/gp/pdp/profile/A2Q6CLLQPXG55A
6. How and where to get Emerald? - Linux Forums
http://www.linuxforums.org/forum/ubuntu-help/119375-how-where-get-emerald.html
7. Lemme hit that wifi, baby! - Linux Forums
http://www.linuxforums.org/forum/coffee-lounge/122922-lemme-hit-wifi-baby.html
8. [SOLVED] lost in tv tuner hell... please help - Ubuntu Forums
http://ubuntuforums.org/showthread.php%3Fp%3D3802299


Querying for info:http://frankmccown.blogspot.com/

Results:
Total results = 1

1. Questio Verum
http://frankmccown.blogspot.com/


Querying for site:frankmccown.blogspot.com

Total results = 463

Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: March 2006
http://frankmccown.blogspot.com/2006_03_01_archive.html
3. Questio Verum: December 2006
http://frankmccown.blogspot.com/2006_12_01_archive.html
4. Questio Verum: June 2006
http://frankmccown.blogspot.com/2006_06_01_archive.html
5. Questio Verum: October 2007
http://frankmccown.blogspot.com/2007_10_01_archive.html
6. Questio Verum: July 2007
http://frankmccown.blogspot.com/2007_07_01_archive.html
7. Questio Verum: April 2006
http://frankmccown.blogspot.com/2006_04_01_archive.html
8. Questio Verum: July 2006
http://frankmccown.blogspot.com/2006_07_01_archive.html

The program is only printing the title of each search result and its URL, but there are many other items you have access to. The partial JSON response looks something like this:
"GsearchResultClass": "GwebSearch",
"cacheUrl": "http://www.google.com/search?q=cache:Euh9Z1rDeXUJ:frankmccown.blogspot.com",
"content": "<b>Questio Verum<\/b>. The adventures of academia, or how I learned to stop worrying and love teacher evaluations.*. Saturday, June 07, 2008 <b>...<\/b>",
"title": "<b>Questio Verum<\/b>",
"titleNoFormatting": "Questio Verum",
"unescapedUrl": "http://frankmccown.blogspot.com/",
"url": "http://frankmccown.blogspot.com/",
"visibleUrl": "frankmccown.blogspot.com"

So, for example, you could display the result's cached URL (Google's copy of the web page) or the snippet (page content) by modifying the code in the example's for loop.

You'll note that only 8 results are shown for the first and third queries. The AJAX API will only return either 8 results or 4 results (by changing rsz=large to rsz=small in the query string). Currently there are no other sizes.

You can see additional results (page through the results) by changing start=0 in the query string to start=8 (page 2), start=16 (page 3), or start=24 (page 4). You cannot see anything past the first 32 results. In fact, setting start to any value larger than 24 will result in a org.json.JSONException being thrown. (See my update below.)

More info on the query string parameters is available here.

From the limited number of queries I've ran, the the first 8 results returned from the AJAX API are the same as the first 8 results returned from Google's web interface, but I'm not sure this is always so. In other words, I wouldn't use the AJAX API for SEO just yet.

One last thing: the old SOAP API had a limit of 1000 queries per key, per 24 hours. There are no published limits for the AJAX API, so have at it.

Update on 9/11/2008:

Google has apparently increased their result limit to 64 total results. So you can page through 8 results at a time, up to 64 results.

Saturday, June 07, 2008

Fav5

My pick of the week's top five items of interest:
  1. How valuable is your old high school or college yearbook? At Purdue University, this year's yearbook will be their last. In fact there are only 80 US colleges that still produce yearbooks, down from 100 last year. Interest is declining with part of the blame on social networking sites like Facebook. What today's students don't realize is that 20 years from now, you may not have access to the memories you have now... free services like Facebook have no obligation to retain them indefinitely for you.

  2. Want to save some electricity and don't mind black? Try out Blackle. Blackle displays search results straight from Google, but because they are displaying search results on a primarily black screen, they are using considerably less power than Google which uses white. A similar search engine is called Blaxel, but I don't think the two are affiliated.



    Update 6-9-08:

    My Finnish amigo has pointed out the error in my post. Apparently only CRTs save energy by displaying black, but the newer LCD monitors which most everyone is using today actually take more energy to display black. So Blackle and Blaxel are actually wasting more energy! Thanks for the tip, Timo.

  3. The Digital Lives research project is conducting a quick survey of how individuals store personal computer files, find them in the future, and archive them. If you have 10 minutes and want to contribute to digital preservation research (and possibly win £200 in British Library gift vouchers), please take the survey.

  4. This is kinda cool: Yahoo has opened up its search results page to developers using a new platform called SearchMonkey. They've also developed a listing of numerous SearchMonkey plug-ins in their Yahoo Search Gallery. Do you want to see details of a movie when searching Yahoo? Download the IMDB presentation enhancement.

  5. I'm not totally sure what to make of this: a new search engine called RushmoreDrive tailors search results for the black community only. Apparently Google is too white; African Americans want different search results than European Americans, Asian Americans, etc. While I agree that web search that takes into account the user's profile (e.g., interests, age, gender, location, etc.) are likely to produce better search results, creating a search engine that tailors only to one racial group smacks of racism. While I'm sure this isn't RuchmoreDrive's intention, wouldn't we all agree that a search engine called Whitey.com that was built for the white community only was racist?

Thursday, June 05, 2008

Ethan time

I really miss my kid (and of course my wife smile). Thankfully Becky sent me some photos, and I thought I'd share a few.


Missed a spot


Boys will be boys


You found me!


Grampy's retirement lunch

Sitting out on the porch.


Glamor Shots pose


Exploring the backyard


Dad, this water's cold!


All Pro


Yep, the dryer's working fine!


Programming at a young age

Goofing around with Uncle Andy

Sunday, June 01, 2008

I'm in Los Alamos

I left Friday morning for Los Alamos and arrived Saturday afternoon (I stayed in Amarillo with some friends). This place is incredibly beautiful. Picturesque. I wish I had my digital camera, but I left it back in Arkansas with the wife and kid, just in case Ethan learns to do a one-handed hand stand (wouldn't you want a photo of your 15 month-old doing that?). You can browse some photos at Flickr to get a feel for what it looks like.

Anyway, I got us setup in an apartment and loaded up on groceries. I'm getting the place all setup for Becky and Ethan who are coming up in a few weeks. Also I drove around to get familiar with the layout, and I marveled at the houses that are built literally 2 feet away from a straight drop hundreds of feet into the crevices below.

This morning I worshiped with the Los Alamos Church of Christ; one of their members was once a CS Harding faculty member in the 1980s. Now I'm checking email, etc. at the library. For some reason they are using IE 2.0 on Windows 85 or something, so it's like I've jumped back in time. Each time I press a key, I wait a second for it to appear on the screen... really fun. Also the people around me are giving me looks because I'm apparently typing loudly.

I'm starting work at LANL tomorrow, and hopefully I'll get the Internet turned on at home. It's really boring sitting in an empty apartment with no family, no TV, and no Internet! If any of you get the chance, send Becky an encouraging email or Facebook message... it's tough dealing with a toddler by your lonesome!