Thursday, November 23, 2006

Today is Thanksgiving

Today is Thanksgiving, and I certainly have a lot to be thankful for. Sometimes it’s good to actually list your blessings, especially when you feel like things just aren’t going your way. So here’s a list of a handful of things I’m thankful for in no particular order:
  • My stunningly beautiful, intelligent, and hilarious wife
  • My first child to be born in April
  • My two favorite teams playing today- Cowboys and Broncos
  • My parents and brother who will be at the Cowboys game, and my sister who is soaking up the sun in Mexico today
  • Andy and Stephanie Walz who have invited us over today to eat turkey!
  • Finishing up a rather time-consuming paper
  • A fantastic advisor with the unique combination of basketball and OAI-PMH skillz
  • Funding for my Ph.D.
  • A church that Becky and I feel particularly blessed to be a part of
  • The class I’m teaching at church on Christian apologetics that seems to be going really well
  • My friends, the Hornes, who are moving back to Virginia Beach
  • Our beat-up Geo Metro that still runs
  • Harding students that are a light to the world
  • Hilarious Payton Manning commericials
  • The Office
  • BBQ chicken pizza at California Pizza Kitchen

Wednesday, November 22, 2006

Search engine API study

I submitted my paper comparing search engine API results with WUI results to the WWW’07 conference on Monday. If you are interested in reading it, feel free to contact me. I wrote a little about it a few weeks ago.

Today I posted the hundreds of graphs that we couldn’t fit into the paper on my website. The graphs were created with R scripts which were a monster to write. I’ll probably be posting those on my website soon for anyone who is interested.

Monday, November 20, 2006

Harding students at HUFS

The other night I was reading the Harding alumni magazine and came across the article The Tour de France and Switzerland to Boot. The article is about the experiences of the first group of Harding students to study at the new international program called HUFS (Harding University in France/Switzerland). Robert McCready, the author and HU professor who chaperoned the group, gave a really encouraging report about the students’ conduct:
As we approached Toulouse, our quiet bus driver asked the guide for the microphone. He proceeded to tell us that in 37 years of chauffeuring, he had never met as fine a group and was impressed by the students’ respect for him and for one another, their wiping their feet before getting on the bus, and their joy in singing devotional songs. He will retire in two years and expressed his wish to do so with Harding students as his last group.
Not only that, but as a result of numerous positive encounters between the students and local Christians in Toulouse, a couple of women decided to enroll at Harding. I’m really proud of those students and the way they were a light to the world.

The magazine also noted that Ward Sandlin, a 1991 Harding alum, was awarded the Air Medal by the Coast Guard for his performance during Hurricane Katrina where he saved 161 lives immediately after the disaster. Congrats!

Saturday, November 18, 2006

We're having a boy!

Yesterday morning Becky and I found out that our little bean was a boy! Becky’s mother came into town on Wednesday evening so she could be with us for the ultrasound. It was a very emotional experience seeing my boy for the first time. When he moved his arms around I felt like this kid was the most incredible of all of God’s creations. There’s just something indescribably awesome about being a father. Next up: picking a name. smile

Thursday, November 16, 2006

Google Archive?

In September Google apparently registered for multiple domain names that implied a Google Archive (or what I call Internet Archive Part Deux) is in the works. Garett Rogers of ZDNet was the first to break the story. It wouldn't be surprising if Google decided to quit throwing away their cached copies of the Web and allow users to search the Web through time. This functionality is something the folks at the Internet Archive have been working on for quite some time. Personally I'm glad they don't provide an archive search- I would be very embarrassed if people could see the first website I created back in 1997. Perhaps with Google's deep pockets we'll see a searchable (and more up-to-date) Internet archive before the year is out.

Saturday, November 11, 2006

WIDM 2006

I presented my paper Lazy Preservation: Reconstructing Websites by Crawling the Crawlers today at the Workshop on Web Information and Data Management (WIDM). I was also the session chair for the Web Organization session. Joan was able to fight through her cough and present her mod_oai paper as well.

This was a competitive workshop (only 11 of 51 submitted papers were accepted), but I was a little disappointed with the small number of attendees (only a dozen or so). The presentations though were quite good. My favorite was “Coarse-grained Classification of Web Sites by Their Structural Properties” where they looked at website characteristics like the number of slashes in a URL and average URL length to determine if a website was a blog, a personal site, a commercial site, etc. Who would have thought you could guess which category a website fell into by looking at URL properties?

I also really enjoyed the keynote speaker, Sihem Amer-Yahia from Yahoo Research, who talked about a project at Yahoo where they are trying to personalize web search based on the community interests of the searcher.

Next year WIDM is going to be in Portugal along with CIKM. Hmm…

CIKM days 2 and 3

Wednesday

The keynote speaker this morning was Gary Flake of Microsoft Labs who entitled his talk: How I Learned to Stop Worrying and Love the Imminent Internet Singularity. (I guess he really liked the title of my blog. ) Some of the topics included power laws, long tails, network effects, and the Innovator’s Dilemma. Essentially the talk was about how human knowledge, the ability to analyze the online world, and the ability to create digital artifacts are all converging to create an Internet singularity which is going to take over the world (or at least seriously change the way we do things). During the Q&A session after the talk, Gary briefly spoke about the “parasitic relationship” between publishers and academia and how we should throw the bums out and publish only on on-line journals. Stevan Harnad would have been proud.

I sat in on several presentations that mostly focused on database enhancements- not really my thing. One of the few papers that I did find interesting though was Xiaoguang Qi’s paper entitled Knowing a Web Page by the Company that it Keeps. Xiaoguang presented an interesting way to know more about what a web page is about by examining the parents, siblings, and children of the page. They also used theYahoo web search API to discover parents.

The banquet Wednesday evening was ok. I didn’t know anyone, but I had a decent chat with a fellow from Jordan who worked in the database area. He told me he was somewhat disappointed with the conference and suggested VLDB was much more interesting. I guess I was a little disappointed too since the focus of most of the research was only peripherally related to my own interests, but I probably should have expected that coming into the conference.

Thursday

The keynote speaker this morning was Joseph Kielman from the Dept of Homeland Security. Basically HS would like to model the way the behavior of the entire world and have the computer say, “Hey, I think Joe Mohammad is about to go jihad on us.” Kielman gave some indication that the bureaucracy at HS made getting things done very difficult.

The one presentation I really liked today was written by a group from Yahoo and Stanford and entitled Estimating Corpus Size via Queries. They showed a method that could be used to answer the question: How many pages in Chinese from US-registered servers are indexed by Yahoo? Their method requires several assumptions to be true such as the query must produce less than 1000 results since search engines do not give access to more than 1000 results.

I skipped out on the last session of the conference so I could catch a matinee showing of The Prestige. It’s a movie about two magicians who are obsessed with discovering each other’s secrets (excellent movie, by the way). It got me thinking… if CIKM would introduce a couple of magic tricks between presentations, maybe get the session chair to make boring speakers suddenly disappear in a flash of smoke, this might turn into one of the “can’t miss” conferences of the year. As it currently stands, I have to admit that librarians know how to have more fun (see JCDL).

Tuesday, November 07, 2006

CIKM 2006 in Arlington

This week I’m attending CIKM 2006 in Arlington, DC. I’ll be presenting a paper on lazy preservation on Friday at WIDM. In the meantime I can just sit back and enjoy the conference and the town. Unfortnately, Becky couldn't come up with me, but at least I got to meet my sister last night for dinner.

This morning Hector Garcia-Molina gave a talk on the research they are doing at Stanford on pair-wise entity resolution. Basically he talked about what entity resolution was and how they were taking an approach that may or may not end up being better than the approaches currently being presented. Hector did a great job of speaking clearly and engaging the audience. It was one of the few talks today that I didn’t find myself wanting to screaming “Drop the laser pointer!” and “Quit talking to the projection screen!”

I attended the “Mining Reviews and Blogs” session this afternoon which had a number of interesting papers, and the poster presentations/reception this evening. I haven’t seen anything that’s very related to my work, but it’s nice to be exposed to some cutting-edge research in information retrieval.

Friday, November 03, 2006

Do the search engine APIs lie?

OK, the title of this post is a little strong. Search engine APIs don't intend to deceive anyone, but they typically do not give the same result as what the rest of the world sees when using the public web interfaces.

Everyday for the past 5 months I’ve been sending thousands of queries to the Google, MSN, and Yahoo on the Internets using the web user interface (WUI), the little box that everyone types their queries into, and using the web search APIs that each of the search engines makes available for free to the public. There’s been a lot of questions as to whether the APIs give the same results as the WUIs, and I’m going to be the first to provide a strong quantitative analysis to see which API's are the most synchronized with their WUIs.

In order to process the incredible amount of data I’ve been collecting, I’ve developed an elaborate set of Perl scripts that transform the raw collected data into tables that are then imported into MySQL. The scripts take several days to complete processing. Then I’ve developed numerous R scripts that pull data from MySQL and plot them to an array of graphs.

I’m currently working on writing up my findings for a conference. If you’d like a pre-print of my paper, I’d be happy to share it with you. Here’s a little teaser.


The graph above shows the daily Kendall tau distance between the top 100 search results obtained from Google’s WUI and API for the term carmen electra. The green line shows how the WUI results change every day, and the blue like shows how the API results change every day. If the results are exactly the same (including their ranking), the distance is 1, but if the results have nothing in common, the distance is 0. The red line shows the distance between the WUI and API results each day. You’ll notice that for the most part the WUI and API values don’t move in a synchronized way, and the WUI and API results are very dissimilar. Other popular search terms like stacy keibler, jessica simpson, and lindsay lohan exhibited similar patterns (although the WUI vs API distance was closer to about 0.8). When we examine search results for terms like nfl football or computational complexity, the WUI and API results are very synchronized, and the WUI vs API distance is closer to 0.9. Maybe they purposefully discriminate against air-heads?



This graph shows the decay of the search results for the term subroutine for all three search engines. To compute decay, I compared the results obtained on each day with each of the results after that day using a normalized overlap measure. In other words, I computed the percentage of results that were shared between the results obtained on day 1 with day 2, 3, 4, etc. Yahoo shows a strong decay line with a half-life of 30 (on day 30 half of the results were gone). Google and MSN show decay lines that actually un-decay (if there is such a word). After several months of the results becoming more different, the results start to return back to their starting point.



One last graph: how many times does the WUI and API agree when asked for the total number of results for a search term? For all three search engines, the answer is almost always zero! But if you look at the graph above, you’ll see that the MSN total results used to agree almost every time until day 58 (late July) when they changed something internally. Now about half of the time their WUI gives a larger number, and the half of the time the API gives a larger number. By the way, the gap under day 107 was due to MSN invalidating our API license key. It took me 17 days before I replaced the key. Moral of the story- keep a close eye on your experiments!