Thursday, July 31, 2008

NameVoyager: Baby name visualization over time

Today I ran across this novel visualization tool called NameVoyager. It allows you to see how the popularity of baby names change over time. The data is collected from the Social Security Administration.

From the screenshot below, you can see all boy's names that start with E. Although Edward was very popular in the late 1800s, Eric started to dominate in the 1950s, and Elijah, Evan, and Ethan (our son's name, ranked #3 in 2007) have taken over since the 1990s.


My name hasn't fared so well (#6 in the 1880s, #262 in 2007). My wife's name went from a high of #13 in the 1970s, but it's since dropped to #105.

The creator of NameVoyager, Martin Wattenberg, writes about it in a white paper entitled Baby Names, Visualization, and Social Data Analysis.

BTW, if you are really interested in how your name affects your socioeconomic status, check out chapter 6 of Freakonomics by Steven Levitt.

Update

Another interesting name visualization tool is NameTrends.net. In addition to showing popularity of names over time, they also show geographical popularity of baby names. The map below, set at 1992, shows how Ethan was really popular in the northeast and north-central states before it became popular in all the states.


Monday, July 28, 2008

How cool is Cuil?

Today a new competitor enters the world of Web search: Cuil (pronounced "cool"). What's notable about this newcomer is that it's president and founder, Anna Patterson, is an ex-Googler as are several of Cuil's VPs.

In 2004, Patterson developed a search engine called Recall that was used to search the Internet Archive's massive corpus (apparently the search engine didn't last long... the Archive is only searchable by URL today). Shortly thereafter, she was hired by Google only to leave in 2006 to startup her own Google competitor. How much of Google's intellectual property went with her? That's a tough one to answer.

So why does Cuil think it can compete with Google?
  1. Cuil supposedly index three times as much content as Google.
  2. Cuil presents results in a magazine-like, multi-column format with more snippet text than Google, including embedded images.
  3. Cuil has an "Explore by Categories" widget that attempts to categorize pages.
Considering Google doesn't index every page they know about, it's hard to argue that the size difference is really significant. What will make or break their search engine is the quality of results and the interface. Some have already done some testing and named Google the winner. I did a little test querying for my name, and the results were not quite up to par.


Here's how I scored it:
  • Result number 1 (top-left) links to my old website at Old Dominion University instead of my current site at Harding (next result to the right). -1 point
  • The photo of me in result 1 comes from a different website entirely, so I'm impressed they made the connection. +1 point
  • The photo in the Harding result is not me (wish I was that tan). -1 point
  • The result at the bottom-left is from DBLP which indexes academic papers. It's certainly relevant. +1 point
  • The photo in the DBPL result is not me (I'm a lot more buff)- it's the actor Frank McCown, better known as Rory Calhoun. -1 point
  • The next result to the right points to celebrity entry for Frank McCown AOL's Television website. This is a website that does their own web mining and erroneously marked my blog and Harding website as belonging to Frank McCown the actor. (BTW, this is a really tough problem to solve.) -1 point
  • The first categorization labels in the upper-right under Digital Libraries were somewhat descriptive of my research interests or projects I've been involved with: Digital preservation, Open Archives Initiative, and LOCKSS. +1 point
  • But when I click on National Science Digital Library, I get 0 results. -1 point
So my overall score:-2 points. Using the same query at Google shows 8 of the top 10 results are about me (result #1 points to my blog, #3 to my Harding website), but Google is less ambitious and doesn't mix in photos or categories. Still, I'd have to give Google a higher score than -2.

Does any else have any thoughts on Cuil?

Update on 7/30/2008:

Someone at Java Rants has created a parody of Cuil using Yahoo's new BOSS Search API: Yuil.

Friday, July 25, 2008

Fav5

My pick of this week's top 5 items of interest:
  1. It's official: Google's Knol, a Wikipedia-like source of user-contributed information, is now available to the public. Lots of people are talking about it. I found the interface a little lacking... there's a search box and a list of some randomly selected Knols, but I can't figure out how to browse by subject (of course it's difficult to do this in Wikipedia as well). Also I can't tell how many knols there are yet; a search for "a" shows about 80 knols, and most appear to be health-related. My guess is the tech guys may stick with Wikipedia.

  2. At this week's SIGIR'08 conference, Microsoft Research presented a paper called BrowseRank: Letting Web Users Vote for Page Importance. The paper introduces a new relevance ranking algorithm called BrowseRank. Instead of relying on the web graph to assign web page importance as PageRank does, BrowseRank assigns importance based on users' browsing behavior. A good summary of the paper is at CNET News.

  3. Facebook will soon be using Microsoft's web search technology to give search results and sponsored ads. Currently Google is powering MySpace.

  4. According to a new report from the antivirus company Sophos, they have detected over 16,000 malicious web pages each day in the first half of 2008, most using SQL-injection techniques. Blogspot.com hosts the largest number of malicious web pages, mostly because of how easy it is to setup a blog with this service and to inject

  5. You'd better hurry: The domain ☼.com is still available! Oh, and Google has discovered at least 1 trillion pages on the Web.

Wednesday, July 23, 2008

Harding University - Oops! Couldn't find that.

You know you're in trouble when your university website gives you an error message like that.


Looks like the Harding IT guys are having some technical difficulties today... maybe they need to use Warrick. wink

Update on 7/25/2008:

Apparently there was a catastrophic disk failure; the IT guys are working through the weekend to restore data and get all the systems back up. Unfortunately, about half of the faculty and staff lost about a week's worth of data because the backups were being done improperly.

Monday, July 21, 2008

Webcrawlers wanted

I'm not a huge fan of advertisements, but this ad on Facebook, sponsored by AddGooro, is just too cool. If any of my search engine students are reading this... they're hiring.

Saturday, July 19, 2008

Fav5

My pick of the week's top 5 items of interest:
  1. With the growing popularity of the iPhone, website designers are paying much more attention to how their sites look on devices with very small real estate.

  2. Will Microsoft bite at $33 a share for Yahoo?

  3. We're getting just a little closer to realize quantum computing.

  4. This will be of interest to many of my students: How much do game programmers make?

  5. Last fall, Harding Univ made the move to Google Mail, freeing us from the burden of maintaining our own email system. Unfortunately, the IT guys left all of our old emails on the old system and didn't forward them on to our new GMail accounts. I just received an email saying in the next few weeks the IT guys will purge all of our emails from the old system. Anyone want to bet some very important email is about to be erased forever? (I did move my email over, but it took a few hours... I would hate to have lost emails from when Becky and I were dating.)

Our fifth year anniversary

Today Becky and I are celebrating 5 years of marital bliss. smile OK, it's not all bliss, but it is wonderful being married to Becky. My prayer is that my son might some day find a wife as intelligent, beautiful, and funny as his mom, and someone who encourages him to be faithful to the Lord.

The photo below was taken a few months before our wedding. (Thanks, Jeff.)


To celebrate, we are leaving Ethan with a babysitter and going out to a small town called Chimayo where there's a famous little chapel called Santuario de ChimayĆ³ and a romantic restaurant called Ranco de ChimayĆ³ Restaurante. Should be a lot of fun.

Thursday, July 17, 2008

Wal-Mart growth: 1962-2007

This is one cool visualization. Watch as Wal-Mart grows like a cancer, beginning from Bentonville, Arkansas. (It's amazing to think the world's most powerful discount store started in humble AR...)

Neo4j

I attended an interesting technical talk with the Proto Team yesterday down in Santa Fe Complex. Emil Eifrem of Neo Technologies shared with us their open source Neo4j project, a high-performance graph database that is implemented in Java.

A graph database is very different from a relational database; rather than storing data in tables of rows and columns, data is stored in a graph data structure (nodes, relationships, and properties) which is obviously a more intuitive model for networks. Such a database is ideal for storing RDF, social networks, co-authorship networks, etc. Although relational databases can be used to represent graphs, answering queries like "Who are all the friends of everyone who likes ice cream" requires many joins to be performed which takes a lot of processing time.

Emil noted that although everyone he talks to says they know what a graph database is, they practically don't exist. Wikipedia doesn't even have an entry entitled graph database, and the database article doesn't mention them at all (graph databases are distinct from the network model). Here's some slides that give a good overview of graph databases and a survey paper by the same authors.

Monday, July 14, 2008

When the Internet is my hard drive

This morning I came across a commentary by Bruce Schneier at Wired entitled When the Internet Is My Hard Drive, Should I Trust Third Parties? Schneier worries about the loss of data that is stored on the Web, including web pages and websites that are not under his control.

Schneier first notes how the Wine Therapy web bulletin board, a place where wine aficionados posted and shared information since 2000, was lost when someone hacked the site and deleted the database. The site owner had been sick and was not keeping a backup. (Join the club.)

After a few anecdotes about broken travel links, the loss of the blogging website Greatest Journal, and MySpace losing control of members' personal data, Schneier says of our online data: "there's no way to predict what will disappear when." And although there are some emerging personal archiving tools, "we don't know which bits we want until they're no longer there." He sadly concludes that "there's not much we can do about it."

Some of the comments are interesting to read. One reader says that "Link rot is actually a healthy way for clutter to evaporate from the web leaving room for useful information." I'd like to see this person's reaction when their blog "clutter" disappears one day. wink

Someone named Daniel commented that there's no simple way to backup your dynamic website, but they remain hopeful: "[I] Also expect that somebody will make it easy to backup your site--and not just from a browser perspective--and then we'll just have to wait until it becomes so cheap it's ubiquitous." hmm... maybe I'm on to something here.

Sunday, July 13, 2008

Los Alamos in the summer

We're half way through our stay here in Atomic City (Los Alamos), and I thought I'd bring everyone up to speed on what we've been doing.

I picked up Becky and Ethan in mid June from the airport in Albuquerque. The first thing Becky noticed was how many of the buildings, especially in Santa Fe, were built in a rather unique Adobe fashion. Yes, even Target:


It took a little adjusting to life in a small apartment with no A/C, but we did it. What really helps is having incredible views all around us



with plenty of places to hike



and play (there are like 10 billion playgrounds)


and a good church which has been very welcoming.

We got Ethan his first haircut just a week after arriving, and despite his facial expression (and the outrageous $16 price), he didn't cry a bit:


So far we've visited a few places including Bandelier National Monument



Taos (where many hippies call home),



and the Albuquerque Zoo (with other moms and kids from church).


I'll report more on our summer in the weeks to follow.

Friday, July 11, 2008

Fav5

My pick of the weeks' top 5 items of interest:
  1. Good news: the number of people working in IT jobs in the US has hit a record 4 million, and the unemployment rate is a meager 2.3%. Advice to incoming freshmen: Give computer science a try.

  2. Chill. It's just a phone.

  3. Neil McAllister asks an intriguing question: "Is the Web still the Web?"

  4. Digg has rolled out a new recommendation system based on the wisdom of crowds.

  5. Some cool new stuff: Protocol Buffers from Google and the BOSS API from Yahoo (see my post from yesterday).

Thursday, July 10, 2008

Yahoo's new Search BOSS API

Yahoo has just released a new web search API called BOSS (Build your Own Search Service) which improves on their earlier API in several ways:
  1. No daily query limits.

  2. No restrictions on how the results are displayed, ordered, or mixed in with other proprietary results.

  3. Ability to make money showing paid results.

The BOOS API is REST-based. You can receive results in either JSON or XML format, and you can get 10-50 results back per query.

There is one item that appears to be missing without explanation: the cached URL of each search result. This URL is useful to the user when the result's live URL is not responding. The old Yahoo web search API did provide this, so I'm not sure why it dropping in Boss.

One thing that makes me a little nervous about the API from a researcher's perspective is the prohibition in their Terms of Service against analyzing their search results:
You will not, will not attempt, or will not permit or take actions designed to enable other third parties to: ... perform any analysis, reverse engineering or processing of the Web Search Results
Analyzing the Yahoo search results is exactly what I did in my paper Agreeing to Disagree: Search Engines and their Public Interfaces. Well, better to do and ask forgiveness than get permission up front. ;-)

So here's a simple example in Java using the new BOSS API to search for the title of my blog "questio verum", the index status of my blog's root page, and all the pages indexed for my blog. To make this example work for you, simply put your Yahoo API key in API_KEY.

Note that this example is very similar to the Google AJAX example in Java from last month.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import org.json.JSONArray; // JSON library from http://www.json.org/java/
import org.json.JSONObject;

public class YahooQuery {

// Yahoo API key
private final String API_KEY = "Your Key Here";


public YahooQuery() {

makeQuery("questio verum");
makeQuery("url:http://frankmccown.blogspot.com/");
makeQuery("site:frankmccown.blogspot.com");
}

private void makeQuery(String query) {

System.out.println("\nQuerying for " + query);

try
{
// Convert spaces to +, etc. to make a valid URL
query = URLEncoder.encode(query, "UTF-8");

// Give me back 10 results in JSON format
URL url = new URL("http://boss.yahooapis.com/ysearch/web/v1/" + query +
"?appid=" + API_KEY + "&count=10&format=json");
URLConnection connection = url.openConnection();

String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}

String response = builder.toString();

JSONObject json = new JSONObject(response);

System.out.println("\nResults:");
System.out.println("Total results = " +
json.getJSONObject("ysearchresponse")
.getString("deephits"));


System.out.println();

JSONArray ja = json.getJSONObject("ysearchresponse")
.getJSONArray("resultset_web");

System.out.println("\nResults:");
for (int i = 0; i < ja.length(); i++) {
System.out.print((i+1) + ". ");
JSONObject j = ja.getJSONObject(i);
System.out.println(j.getString("title"));
System.out.println(j.getString("url"));
}

}
catch (Exception e) {
System.err.println("Something went wrong...");
e.printStackTrace();
}
}

public static void main(String args[]) {
new YahooQuery();
}
}


Running this program produces the following results:


Querying for questio verum

Total results = 13600

Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. WikiAnswers - What does questio verum mean
http://wiki.answers.com/Q/What_does_questio_verum_mean
3. Questio Verum: URL Canonicalization
http://frankmccown.blogspot.com/2006/04/url-canonicalization.html
4. Questio Verum: WIDM 2006
http://frankmccown.blogspot.com/2006/11/widm.html
5. Questio Verum: Fav5
http://frankmccown.blogspot.com/2007/09/fav5_29.html
6. Questio Verum: Fav5
http://frankmccown.blogspot.com/2007/12/fav5.html
7. Questio Verum: August 2006
http://frankmccown.blogspot.com/2006_08_01_archive.html
8. Amazon.com: Profile for Questio Verum
http://www.amazon.com/gp/pdp/profile/A2Q6CLLQPXG55A
9. Questio Verum: JCDL 2007 - day 2
http://frankmccown.blogspot.com/2007/06/jcdl-2007-day-2.html
10. Questio Verum: OA debate - Eysenbach and Harnad
http://frankmccown.blogspot.com/2006/05/oa-debate-eysenbach-and-harnad.html


Querying for url:http://frankmccown.blogspot.com/

Total results = 1

Results:
1. Questio Verum
http://frankmccown.blogspot.com/


Querying for site:frankmccown.blogspot.com

Total results = 4080

Results:
1. Questio Verum
http://frankmccown.blogspot.com/
2. Questio Verum: OA debate - Eysenbach and Harnad
http://frankmccown.blogspot.com/2006/05/oa-debate-eysenbach-and-harnad.html
3. Questio Verum: JCDL 2007 - day 2
http://frankmccown.blogspot.com/2007/06/jcdl-2007-day-2.html
4. Questio Verum: No singles here
http://frankmccown.blogspot.com/2007/08/no-single-here.html
5. Questio Verum: Pledge Week and Insults
http://frankmccown.blogspot.com/2007/10/pledge-week-and-insults.html
6. Questio Verum: WIDM 2006
http://frankmccown.blogspot.com/2006/11/widm.html
7. Questio Verum: Fav5
http://frankmccown.blogspot.com/2007/09/fav5_29.html
8. Questio Verum: August 2006
http://frankmccown.blogspot.com/2006_08_01_archive.html
9. Questio Verum: Fav5
http://frankmccown.blogspot.com/2007/06/fav5.html
10. Questio Verum: Fav5
http://frankmccown.blogspot.com/2007/12/fav5.html


Thanks, Martin, for the head's up on this.

Update on 7/28/2008:

The missing cached URL feature is apparently coming soon.

Saturday, July 05, 2008

Are SnapShots annoying?

You may have noticed that I recently added SnapShots to my blog. When you hover over a link to an external website, a preview box pops up to show you what it links to. For some websites like Wikipedia, the preview box is formatted in a different manner (see below).


The purpose of SnapShots is to allow you to see if you really want to click on the external link and perhaps keep you on my website longer. They also display advertising and will share some of the revenue with me.

All I've really noticed though is how annoying it is to have a window pop-up over the text I'm trying to read. Does anyone else agree with me?

Maybe I just need more time to get used to it. However, a paper presented at JCDL 2008 shared an experiment with a PDF viewer where the users complained about a preview box popping-up over their text... in general it looks like users are not appreciative of such a feature.

Friday, July 04, 2008

Fav5

Happy Independence Day! My pick of the week's top 5 items of interest:

  1. The Deep Web is getting a little shallower: Google has recently improved their ability to crawl Flash content. It used to be that a website with a Flash interface was practically invisible to search engine crawlers, but now Google can find the links to other pages within the Flash program and even indexes the program's textual content.

  2. If you were late signing up for a free Yahoo email address, you might have ended up with frank1837abc@yahoo.com since all the good names were already taken. But now Yahoo has opened up two new domains under "ymail" and "rocketmail". Although Yahoo is still the email market leader with 266 million worldwide users, they want to stay ahead of Microsoft who is a close second with 264 million.

  3. Are Google, Yahoo, and Microsoft censoring themselves more than they must in China? According to a report by Citizen Lab, some sites are blocked by some search engines and others let them through. In a test to see how many questionable sites were blocked, Google had censored 15.2% of the sites tested, Microsoft censored 15.7%, Yahoo 20.8%, and Baidu (the most popular Chinese search engine) 26.4%.

  4. A Harvard marketing professor takes on Chris Anderson's "Long Tail" theory by analyzing real data about online video rentals and song purchases. Her conclusion is that maybe we aren't as individualistic in our taste as Anderson suggested.

  5. In the spirit of July 4, read about how LANL scientists are making fireworks a little "greener".

Thursday, July 03, 2008

Restart applet in Firefox

I was doing some debugging on a Java applet this morning using Firefox 3, and I couldn't figure out how to restart my applet. I was rebuilding my applet and then hitting the refresh button on Firefox, and the old version of my applet was still being executed.

In IE you must press the Ctrl button while pressing refresh (or Ctrl-F5), but this was not working for Firefox.

I finally figured it out: Open the Java Console (available from the Tools menu) and press x which runs the "clear classloader cache" option. Then press refresh, and the newest version of your applet will load.

Wednesday, July 02, 2008

The Web is getting smaller

There has recently been a surge in tools designed to make the Web a little smaller, in a sense:
  1. TinyUrl - This tool has actually been around for a while, but only recently has it been catching on. It allows you to create a small URL that is more manageable when emailing friends, posting on Facebook, etc.
    http://www.example.com/my/very/long/url  -->  http://tinyurl.com/2

    Update: urlBorg and the spanking new bit.ly are similar services that offer even more functionality.

  2. LinkBunch - The same thing as TinyUrl except your link resolves to one or more URLs.
    http://www.example.com/link1
    http://www.example.com/link2 --> http://linkbun.ch/abc
    http://www.example.com/link3

  3. Tinydb - Stores a small amount of information that can be accessed from a tiny URL.
    http://tinydb.org/_write?name=mccown&topic=small+web
    --> Puts key/value pairs into the database

    http://tinydb.org/1ct?_f=xml --> Returns XML-formatted results:

    <xml_data>
    <url>None<url>
    <topic>small web<topic>
    <tinydb_id>1ct<tinydb_id>
    <name>mccown<name>
    <created>2008-07-02 14:57:36.657199<created>
    <xml_data>