Saturday, June 30, 2007

Save your Yahoo! Photos

I just received this email yesterday from Yahoo! Photos, telling me that my photos were going to disappear unless I took action. I couldn't help but wonder what if I my spam filter had eaten the message, or what if I wasn't using this email account anymore?

Considering that just 57% of individuals actually backup their personal data, how many people do you think are depending on Yahoo! to preserve their photos indefinitely? You can image the horrible feeling of logging in and realizing all those photos of your newborn child have ended up in the big bit bucket in the sky. Sigh...

Dear Yahoo! Photos user,

We will officially close Yahoo! Photos on Thursday, September 20, 2007, at 9 p.m. PDT. Until then, we are offering you the opportunity to move to another photo sharing service (Flickr, KODAK Gallery, Shutterfly, Snapfish, or Photobucket), download your original-resolution photos back to your computer, or buy an archive CD from our featured partner (for users of the New Yahoo! Photos only). All you need to do is tell us what to do with your photos before we close, after which any photos remaining on Yahoo! Photos will be deleted and no longer accessible.


Please give us your decision by Thursday, September 20, 2007, at 9 p.m. PDT. After that time, any photos remaining in Yahoo! Photos will be deleted. Click here to make your decision, or review a list of our frequently asked questions.

Friday, June 29, 2007


It's been a few weeks since I've posted a Fav5, but now that I'm through with traveling, I'm back at it.
  1. Internet music is about to die. Thursday marked a day of silence in protest of the absurd fees that must soon be paid by Internet music providers. Very sad indeed.

  2. The iPhone is finally here. Being the traditional late adopter that I am, I don't really give a hoot, but it's entertaining to see all the technophiles go crazy over it.

  3. An interesting article in the NY Times discusses social searching and the new searching paradigms that may challenge Google's lead in the Web search field. Matt Cutts says (unofficially) that Google is not opposed to leveraging the wisdom of the crowds in the future to stay in the lead.

  4. Researchers at Northeastern University have demonstrated the minimum number of moves needed to solve a Rubik's Cube is 26.

  5. I came across Citizendium this week. It's essentially a Wikipedia where only the educated and pre-approved can edit articles. I submitted my credentials and am waiting to hear back. In theory, Citizendium should be of higher quality than Wikipedia, but I think the barrier-to-entry is just too high to make it a serious contender. Besides, sometimes anonymous contributions can be useful. wink

Saturday, June 23, 2007

IWAW 2007

The International Web Archiving Workshop was held at JCDL this year (normally it's held at ECDL). Our research group at ODU was well represented with 3 presentations. I presented our new Warrick queueing system called Brass (slides are available here). This was the second time I attended, the first time being in Vienna 2005.

There were some excellent presentations by a number of people including the Internet Archive, UNC, CDL, and others. One that I found really interesting was about Preserving 2008 Presidential Election Videos (work from VidArch) where they have been archiving YouTube videos using their TubeKit. Overall I really enjoyed the workshop and hope to attend again.

I'll be heading home tomorrow!

JCDL 2007 - days 3 & 4

The conference opened on Thursday with a keynote from John Willinsky (Univ. of British Columbia). Amazingly enough, John gave an entertaining talk without the aid of PowerPoint. John focused primarily on issues surrounding the open access (OA) movement, exhorting the digital library community to take the lead. At one point in the talk, John gave a shout-out to Marko's poster, using the metaphor of "fingerprints" that define a variety of ways scientists leave their marks on the scientific world.

I attended several sessions, but one of the more interesting presentations I attended was on a paper called Measuring Conference Quality by Mining Program Committee Characteristics by a group from Penn State. The authors basically divided a large list of conferences into reputable and non-reputable by examining the quality of the conference's PC members. To verify that some of the conferences that were judged non-reputable were indeed non-reputable, they created 3 bogus papers using the SCIgen software which was used in the MIT prank. The 3 papers were submitted to 2 conferences, and 2 of the 3 papers were accepted! There were obviously no reviews returned for the papers.

I also enjoyed Cathy Marshall's presentation on The Gray Lady Gets a New Dress: A Field Study of the Times News Reader. Cathy's presentations are always filled with interesting photos and stories.

On Friday I gave the first talk in the User Studies and User Interfaces session on my paper Agreeing to Disagree: Search Engines and Their Public Interfaces. I got some really positive feedback, and was turned onto some related work by Jamie Teevan.

Right after my talk I caught Marko's, but I had to skip out on a few afternoon talks in order to complete my slides for tomorrow. I did attend the closing ceremony- looks like JCDL is going to be in Pittsburgh next year, and the conference dinner is going to be held on a boat! Should be another great conference.

Wednesday, June 20, 2007

JCDL 2007 - day 2

This morning Daniel Russell (Uber Tech Lead for Search Quality & User Happiness at Google) gave a fantastic keynote address entitled What Are They Thinking? Searching for the Mind of the Searcher. Daniel spoke about the great amount of research Google does to figure out how users use Google to search. Here are a just few points of interest from Daniel's talk:
  • For North America, the average user performs 9.4 searches a week.
  • 60% of users perform one Google query or less a day.
  • Only 10% of users go to the next page of results.
  • Half of all clicks to the Advanced Search page result in the user abandoning their search, most likely because they are overwhelmed by the Advanced Search interface.
  • Most users look at 3 results before clicking on a result.
  • Many users "teleport"... instead of using Google to make a search (like searching for a flight), they search Google for a website from which they can then make their desired search.
  • The more query terms used in a search, the longer amount of time the user will spend examining the search results.
  • Most people don't have a clue how Google works, and they don't really care. For example, many individuals Google only examines the first couple of lines of text from a page.
After Daniel's talk and a break, I presented my first paper, Factors Affecting Website Reconstruction from the Web Infrastructure. (You can see my slides here). Joan presented after me: Generating Best-Effort Preservation Metadata for Web Resources at Time of Dissemination. Both talks seemed to go really well, and we both got some positive feedback.

This evening I attended the Minute Madness where a representative from each of the demonstrations and posters got a minute to advertise their demo/poster. The poster reception was held immediately afterward, and Marko won top poster again (that punk won top poster last month at WWW07)!

I think the most interesting poster I saw though was Blogger Perceptions on Digital Preservation, a study conducted by UNC. The researchers performed a survey asking bloggers questions like, "What would happen if your blog suddenly disappeared?" Their work coincides with mine- I recently coauthored a paper where we asked individuals who actually lost their websites questions like "How did losing your website affect you?"

Tuesday, June 19, 2007

JCDL 2007 - day 1

I'm in Vancouver at JCDL 2007. I flew in yesterday with Martin, and the rest of the ODU posse flew in today. Between the 3 of us Ph.D. students, we'll be making 6 presentations at JCDL and IWAW this week.

This is my first trip to Vancouver, and I must say it's one of the most beautiful cities I've been to. Just wish I had Becky and the Bean here with me to enjoy the beauty. The photo on the right shows what the view looks like from my room at the Westin Bayshore hotel.

This morning I joined Kris Carpenter and Brad Tofel of the Internet Archive in a tutorial about researching the Internet Archive. I shared some of my research with Warrick and a recent study analyzing IA overlap with search engine caches. (You can see my slides here.)

I learned a lot about IA from Kris and Brad. Here are just a few items of note:
  • The IA currently has archived around 96 billion resources (html, pdf, images, etc.), or about 1.9 petabytes of data, 51% of which is unique.

  • Although the Archive's holdings are 6-12 months out-of-date, beginning July 1, the Archive will receive updates on the first of each month and will only be 2 months out-of-date.

  • The Archive is currently working on adding full-text search to its contents from 1996-2000 in a project called 20th Century Find. (No URL yet.)
Tomorrow morning I'll be making a presentation about a Warrick experiment, and Joan will be presenting a short paper after me. Looks like I'll be burning the midnight oil getting prepared for tomorrow...

Thursday, June 14, 2007

Ethan 2.0

Wow- what a busy week. I’m getting ready for JCDL 2007 in Vancouver where I’ll be presenting two papers, one on a large website reconstruction experiment I performed several months ago and another on the search engine APIs. I’ll also be talking about Warrick at the Internet Archive tutorial and the web archiving workshop. Plus I’m doing my best to get Warrick’s queueing system on-line so I can demo it at the tutorial. Note to self: always go with PHP over Java servlets when using Apache!

My parents are coming into town tomorrow to see Ethan. They got to spend a little time with Ethan 1.0 the first week he was born, and now they’ll get to see Ethan 2.0, the smiley, attentive, and cute-as-can-be Ethan. Just in case you haven’t seen enough photos of Ethan yet, here’s some more from the last month:


With Mom

Now you're in trouble!

Just kidding!

What, me scared?

Catching some TV



Happy boy

Play ball!

Double chin action

Passed out

Saturday, June 09, 2007


  1. Thanks to Mike Baur for sending me a link to this video of some really cool image visualization software being developed at Microsoft... I especially liked the aggregation feature which pulled images from Flickr together to form a complete image. If you thought the video was cool, you may also be interested in these other videos from the TED conference.

  2. Vote for your favorite 1980’s arcade game. Although I genuinely loved most of the games on the list, I had to go with Tron. smile

  3. There’s a good article from the New York Times about Google’s continued focus on search: Google Keeps Tweaking Its Search Engine. Matt Cutts also put in his 2 cents on about Google’s search focus.

  4. Researchers are finding that bullying in the virtual world is becoming quite a problem, especially for newbies.

  5. has received a makeover. Ask3D is their new approach at integrating various vertical search results together in a single results page. Nice.

Sunday, June 03, 2007

Vote for Ethan!

Ethan's photo has been entered into the Funny Photo Contest for the month of June. He's Photo B2, the baby with the nerdy glasses on. He told me he'd really appreciate your vote. smile

Update on 6/8/2007

Looks like Ethan has a strong lead. Thanks to all those who have voted!

Update on 6/19/2007

Although Ethan won the popular vote, the panel of judges thought the second place photo was actually funnier (maybe they should borrow Ethan's glasses?). That's alright... Ethan is learning at an early age that winning isn't everything and losing produces character. wink

Saturday, June 02, 2007


  1. On Wednesday, Google released Google Gears (beta), a browser plug-in that allows you to run web applications whether you are connected to the Internet or not. The small plug-in contains three components: a web server, an open-source database SQLite, and browser extensions for running JavaScript code in parallel. More about it here.

  2. On April 27, Estonia relocated a Soviet-era memorial, the Bronze Soldier, which honored an unknown Russian soldier who died fighting the Nazis in World War II. Missionary friends of mine who live in Tallinn, the capital, emailed me about the ensuing riots and unrest the incident caused. (I’ve been to Estonia on three occasions, teaching Bible classes and working as a camp counselor.) Some background: Estonia was once begrudgingly part of the Soviet Union, but since gaining their freedom in 1991, there has been a lot of tension between the resident Russian (30%) and Estonian (70%) populations. Relocating the monument was essentially seen as a slap in the face to the Russians.

    What my friends didn’t tell me was that there was also a large scale distributed denial-of-service attack (DDoS) attack on several Estonia websites, including several government sites. At the peak of the attack on May 9, fifty-eight websites were shutdown at once. An interview with Jose Nazario, a security researcher, sheds some more light on the DDoS attack.

  3. Congratulations to ACU (a sister school of Harding’s and the alma mater of three of my family members) for being mentioned in a Wall Street Journal article on social networking: At Some Schools, Facebook Evolves From Time Waster to Academic Study
    For the past year, Abilene Christian University in Abilene, Texas has funded two research projects that use social-networking site Facebook to examine student retention trends, in part because the school noticed its students were already spending so much time on the site, said K.B. Massingill, executive director of the division that funded the research. A group of undergraduate students also studied faith-related conversations in Facebook and MySpace and presented their findings to what Mr. Massingill called an unusually well-attended faculty session. "We filled up the room," he said.
  4. All good things must come to an end, including Battestar Galactica. There are very few TV shows I like, so it’s sad this one will be ending after next season. I’ve really enjoyed the show for the most part. I think they really messed up recasting Starbuck as a woman, especially one that is so annoyingly self-destructive, but the idea of a robot race hunting down their creators lends to a ton of creative plots and twists. My hope for the final season: focus more on the sci-fi aspects and less on the nutty Apollo-Starbuck relationship.

  5. I’m planning on teaching a course on search engine development and web mining next spring when I return to Harding, so I spent some time this week examining a book that is in progress called Introduction to Information Retrieval. One of the book’s authors, Prabhakar Raghavan (Yahoo! Research), was one of the main speakers at the WWW2007 conference a few weeks ago.

    Inspired by some of the concepts in chapters 19 and 20, I created a few new Wikipedia articles: adversarial information retrieval, web search queries, and focused crawlers. It’s kind of fun creating new Wikipedia articles- it’s a little like putting a flag in the ground and claiming ownership.

Friday, June 01, 2007

Spelling bees and other ways to waste your child's time

Warning: Rant ahead.

Last night I caught a few minutes of the National Spelling Bee contest on TV. Kid after kid tried to spell the most ridiculous words, stalling as they repeatedly asked for the definition, alternate pronunciations, etc. After the contest had ended, thirteen-year-old Evan O'Dorney, the contest winner, had this to say:
"My favorite things to do were math and music, and with the math I really like the way the numbers fit together, and with the music I like to let out ideas by composing notes—and the spelling is just a bunch of memorization."
This wise kid seems to understand the futility of spelling bees- it is nothing more than rote memorization of the long tail (see figure below). What I mean is, spelling bees don’t test your intelligence, only your ability to memorize words that no one, even the memorizer, will ever use in daily conversation or even in their writing in graduate school. (OK- a word or two may appear on your SAT.) Honoring a kid as the best speller in America is like honoring a kid for being able to memorize the most digits in pi.

It’s sad we English speakers have a language that uses so many words that aren’t phonetically spelled and words brought in from every other language on the planet. Consider there are about 100,000 words in the French language, but 10 times as many in the English language! Countless hours are wasted by children (and those learning English as a second language) learning trivial spellings and rules like “a i before e accept except after c.” Wouldn’t their time be better spent learning to play an instrument, playing a sport, or even just playing with their friends? Or how about spending that time serving their church or community?

Don’t hear me wrong- I acknowledge there are some benefits to spelling bees in general, but our focus should be on the blue circle, not the red. I'd much rather my son put down the Webster and pick up his tennis racket any day. wink

On a related note: Kudos to Mozilla Firefox 2.0. They added spell-check functionality into the browser, so now when I'm editing my blog or Wikipedia in a textbox, I can easily correct my spelling mistakes just like I would using MS Word! (Ha! No need to memorize the spelling of seldom used words for me!)