Questio Verum: May 2007

Saturday, May 26, 2007

First Digital Preservation Challenge

The DigitalPreservationEurope (DPE) has just announced its first international Digital Preservation Challenge:

The challenge invites participants to overcome the barriers hindering access to six digital objects. Each object is accompanied by a highly abstracted scenario based on real-life situations. These scenarios are intended to make the challenge more accessible to participants from all backgrounds while not trivialising the serious nature of the digital preservation challenges facing society.

The competition is open to undergraduate and graduate students in any major, and the winner will be announced in September at ECDL.

Friday, May 25, 2007

Fav5

I’m back from Archiving 2007. Yesterday Czeslaw Jan Grycz (Internet Archive) gave an entertaining keynote address. He showed this video about copyright and fair use which has made by favorite 5 for the week: A Fair(y) Use Tale by Eric Faden. My guess is since the entire video is made from Disney clips that it probably won't hold up in court as fair use (Disney couldn’t possibly leave this alone). By the way, this video reminded me a lot of the 100 movies video I just saw the other day.

How do you get people to help you digitize books with words that a character recognition scanner can’t recognize? Put the unrecognized words into a captcha! Brilliant.

The IETF gave preliminary approval to a promising antispam/antiphishing technology called DomainKeys Identified Mail. It uses cryptograph digital signatures to identify the sender and receiver of email messages.

According to Nielson/NetRatings, Google accounted for 55% of all searches in April. Believe it or not though, other search engines besides Google actually do exist. In fact, someone recently took the time to rate 300 of them. (I would have added one more criteria though: Does the search engine provide a link to the cached resource?)

Colorado Christian University recently were told they were “too Christian” for their students to receive equal access to tax payer money. I added a couple of comments to the story although I normally don’t. I think CCU should just leave it alone... if you receive mammon from Caesar, you may find yourself beholden to his wishes. It's also hard to imagine Jesus wanting his followers to sue for tax money.

Wednesday, May 23, 2007

Michael Nelson and I are at Archiving 2007 in Arlington, Virginia. This is my first time attending the conference, so I’m learning a lot about archiving in general. Although I again had to leave Becky and the Bean at home, I got to see my sister last night since she lives in D.C.

The keynote speaker yesterday, Daniel Rosen from Warner Bros., gave a really interesting presentation on the incredible amount of data the movie industry is producing and needs to preserve. Approximately 780,000 hours of footage has been produced since 1890, and now that most footage is shot digitally, the amount of footage is growing exponentially. Not wanting to “be the guy who was responsible for losing the Wizard of Oz,” Rosen suggested that the responsibility for preserving the movie industry’s output was immense. From what I understood, the Rosen suggested the best way to preserve the output was to use analog medium rather than digital.

I was the final speaker on Wednesday. I presented my paper Characterization of Search Engine Caches and got some really positive feedback about lazy preservation. (My slides are here.) I also had an individual in the audience suggest it was not preservation at all; I wish I had pointed out that the Web Infrastructure supports migration and refreshing, but it didn’t occur to me when I was put on the spot. Those archivists can get really testy when you deviate from the traditional model of preservation. wink

I’m looking forward to Cathy Marshall’s presentation on Thursday, Evaluating Personal Archiving Strategies for Internet-based Information.

Saturday, May 19, 2007

Fav5

My favorite 5 for the week:

This week Google rolled-out its “universal search,” a revamp of its traditional web search results that now incorporates images, videos, news, etc. After playing around with it some, I’d say it’s an improvement, but not substantial. Microsoft has also launched a test search engine, Imagine Live Search, which attempts to do the same thing.

Congratulations to Natalie Portman, the number one searched-for Star Wars term at Lycos.

pipl - a new deep web search engine that focuses on finding information about people. The first result for Frank McCown? A former criminal.

Scientists and ethicists are starting to give much more attention to rules governing robot-human interaction. The idea is if we can set ethical rules for how humans should treat robots and vice versa, we’ll avoid a world like the Terminator, Matrix, et al.

This week I gave Stanford exams to elementary school kids at the Ivy League Academy. I’ve done this for three years now, and I really enjoy it... they are some of the best behaved kids I’ve seen. In giving the tests, I’ve come to learn more about what children think:
- When asked if either the moon or the sun is a star, most first graders will choose the moon.
- If a first grader were to come home and find their front door open when it shouldn’t be, most of them would either stand outside their door or go inside to investigate instead of seeking help from a neighbor.
- Most second graders do not know what an opinion is.
- And finally, if given the choice of doing an extra day of testing or doing homework that evening, most students will choose testing.

Tuesday, May 15, 2007

What did Ethan do while Dad was away?

Ethan missed his dad quite a bit this past week, even though Dad dressed him up like Urkel the day before he left town. At least he finally found some time to sleep.


"I bet Dad is learning how the Semantic Web is nothing but a pipe dream..."

Monday, May 14, 2007

WWW2007 in Banff, Alberta

The WWW2007 conference is one of the best I’ve attended. The speakers were great, the papers were top-notch, the food was excellent, and you couldn’t beat the location. It was also one of the most expensive conferences I’ve been to, so I made myself attend every session I could.

Tuesday. I split time between the Adversarial Information Retrieval on the Web (AIRWeb) and the Query Log Analysis workshops.

The clearest message I left AIRWeb with was that web spam and splogs can be detected in a number of ways which will likely change over time, but the one thing that won’t change is that it will always be motivated by financial gain. In other words, follow the money. The Query Log Analysis workshop had an interesting panel talking about issues surrounding the use of search engine query logs by academic researchers and the public. I especially liked Bernard Jansen’s proposal of archiving search logs for posterity just as we archive the Web.

Wednesday. Tim Berners-Lee opened the conference with a talk about Web Science, a new initiative by MIT and the University of Southampton. An interesting comment that Tim made was that spam will not make its way into the Semantic Web. We’ll see about that...

The most impressive presentation on Wednesday was on CSurf. The presenter used lots of examples which helped clarify what they had done. One thing that I’ve noticed at every conference I’ve been to is that some of the best researchers are not necessarily the best communicators, so it was nice to actually see an effective presentation along with a good paper.

Another paper, The Discoverability of the Web was also interesting, but I wondered if wide-spread adoption of Sitemaps and mod_oai would make much of their work irrelevant.

Thursday. Prabhakar Raghaven of Yahoo gave an interesting talk about immerging technologies to fuel Web N.0. Raghaven pointed out the ESP Game, a creative game developed at Carnegie Mellon that encourages people to label images in an entertaining way.

Later I sat in on a talk by Bradley Horowitz, also from Yahoo, who discussed some initiatives at Yahoo to change how people search the Web. Essentially they want to make everyone creators, contributors, and consumers instead of the current model where 1% creates, 10% contributes, and 100% consume.

In the afternoon I attended Uri Schonfeld’s presentation on DUST, an extension of their work from last year when they had a poster. (Thanks for the citation.) I also sat in on the DevTrack and heard an interesting talk by Marc Hadley (Sun) about WADL, a way of creating Java stubs automatically for web applications.

Fun dinner Thursday night:

Friday. Today's plenary speaker was Bill Buxton (Microsoft Research). Interesting guy with some interesting predictions: MySpace is just a fad, and pixels will be everywhere and cost nothing in 5 years. The crowd seemed to like Bill’s talk a lot.

I attended the DevTrack for two sessions. The biggest highlight was Yahoo Pipes; the Semantic Web Browser presentation by Tim Berners-Lee was so-so.

I spent the breaks manning my poster and talking to interested passerbys. Quite a few showed interest, and I even had a few Microsoft and Yahoo guys stop buy. The Google guys were unfortunately nowhere to be found. Note to self: next time I have a poster, bring business cards and have some copies of my papers available like Marko did. Also I need to get a better spot- I’m not sure who Johan and Marko paid-off. wink

Saturday. I was a little burned out by Saturday, but I still managed to attend two sessions and the plenary speaker, Dick Hardt (Sxip Identity). I don’t think I’ve ever seen a talk quite like Dick’s- he probably averaged 5 slides per sentence, and it was orchestrated perfectly.

My favorite talk of the day was by Luca de Alfaro: A Content-Driven Reputation System for the Wikipedia. Basically they propose an elaborate system of measuring the input from each Wikipedia author which would allow users to see which authors are the most reputable. Luca told me after the talk that Wikipedia was showing interest is his system.

At the closing ceremony, Johan and Marko were awarded best poster for Friday, even though Johan can't seem to spell "Scholarly". I accepted the award on their behalf and had a really nice dinner with the award money. smile

Miscellaneous:

Most Interesting Fact: The average WWW'07 paper was submitted 20 times.

Most overused acronym: JSON

Paper I’ll Probably Read Next: Detecting Near-Duplicates for Web Crawling
Poster I’m Most Likely to Cite Soon: A Large-Scale Study of Robots.txt

Paper I’d Most Like to Re-Title: Effort Estimation: How Valuable is it for a Web company to Use a Cross-company Data Set Compared to Using Its Own Single-company Data Set?
(How about “The Web Company’s Use of Data Sets in Effort Estimation” instead?)

Wednesday, May 09, 2007

I'm at WWW2007

It's currently my second day at WWW. I left my laptop at home, so I don't have much time to post about the conference just yet. I'll probably try to write up a summary when I get home. I'm really enjoying Banff so far, and the conference is top-notch. Tim Berners-Lee gave the opening talk this morning which was really cool. More later.

Friday, May 04, 2007

First Fav5

I'm starting a new weekly installment on my blog entitled "Fav5" where I list 5 items of interest for the week. I'm not putting any restrictions on the list, so anything may appear here. You may notice this week's list is a little heavy on the search engine side.

I recently found out that Ask, MSN, and Yahoo are now supporting Google's Sitemap Protocol. The announcement was made back in Nov. 2006 although Ask apparently joined a little later, and MSN still hasn't implemented it yet. The "official" website for sitemaps is now http://www.sitemaps.org/.

Just last month an autodiscovery method was announced which makes it easy to notify all the search engines of your sitemap by naming it in your robots.txt file. Alternatively you can use the search engine's interface or ping a search engine via an HTTP request to let them know the sitemap file location.

If you didn't already know, Google posts their Tech Talks on Google Video. There are some top-notch presentations out there if you have an hour to burn per talk.

The most recent talk is The Internet of Things: What is a Spime and why is it useful?
by Science Fiction writer and futurist Bruce Sterling. Early in the talk, Bruce rightly lauds Vannevar Bush's prediction of a memex device as one of the most "brilliant acts of technological forecasting, ever."

Google announced their new Web History feature a few weeks ago:
Today, we're pleased to announce the launch of Web History, a new feature for Google Account users that makes it easy to view and search across the pages you've visited. If you remember seeing something online, you'll be able to find it faster and from any computer with Web History. Web History lets you look back in time, revisit the sites you've browsed, and search over the full text of pages you've seen. It's your slice of the web, at your fingertips.
Matt Cutts shows some use cases for Web History and also ties it in to Bush's memex device.

Joel points out a very interesting usability problem posed by the elevators installed in the new World Trade Center highrise.

And on a personal note, I'm going to be presenting a poster Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? at the World Wide Web conference in Banff, Alberta next week. You can see a PDF of my poster here. Please come by my poster and say hello if you're at the event.

I just found out Johan (ex-ODU professor) is going to be at WWW too. It's too bad his poster is kinda lame.