Monday, March 27, 2006

Warrick is gaining traction

On Friday I submitted a paper to HYPERTEXT ’06 about our web-repository crawler Warrick. The paper focuses on the difficulties of crawling web repositories like the Internet Archive and Google, and it presents an evaluation of three crawling policies. I really like the conference’s flexible 8-12 page ACM format. A few weeks ago when I was submitting a paper to SIGIR ’06, it was a real pain to get everything to fit on only 8 pages.

On that note, I have updated Warrick with the most recent changes:
  • Warrick now uses the Google API for accessing cached pages.
  • Warrick issues lister queries (queries using “site:” param) to Google using page scraping.
  • Yahoo API libraries were updated due to a March 2006 change.
  • Several minor bugs were corrected.
The biggest reason for integrating the Google API was because Warrick kept getting black-listed by Google after 150 or so queries. Michael suggested I write up my experiences in a technical report. It certainly is something that is going to influence researchers from now on who want to test Google for just about anything.

I also received several emails from the Internet Archive last week about Warrick. Apparently the guys that do backups for people with missing websites are excited about the tool, and IA will start informing users to use it:
If you are tech-savvy and know how to use command-line utilities, you can also refer to the Warrick tool here: http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html and be sure to email the makers as they track who is using the tool. For this tool, a third party has put it together and we cannot guarantee the results. If you have questions about this tool, please refer your questions to the makers themselves.
One of the IA employees told me she has performed at least 200 recoveries for individuals in the past year. That’s a lot of people using “lazy preservation” and sure does support the need for research in this area.

Monday, March 13, 2006

Accoona.com generates buzz

I read an article about Accoona.com this morning which suggested this up-and-coming search engine may give Google a run for their money. Accoona (named after the Swahili phrase “accoona matata” which means “no worries”) has been around for a few years. They threw a huge party in Dec 2004 with guest speaker Bill Clinton in order to generate some buzz, and since then they’ve worked on improving the artificial intelligence which runs their engine.

I did some playing around to see if Accoona could be used as a web repository for Warrick. I did a search for “frank mccown” to see what would appear:


When you place your cursor on top of each search result, it shows the domain name at the bottom of the browser (indicated by the arrows), but not the full path to the URL. Searching through the HTML, I found this very unsightly URL:

<a href="http://www10.overture.com/d/sr/?xargs=15KPjg1mdS55Xyl%5FruNL
bXU6TFhUBf14Prpo5wWsU8AJsIrSE%2DDqchceOZxIEnDvI%5Fq
lOM24HMv6oXLvD%5FkfKJFgeNRkXRFJWL2I%5FHzo4%2DQNmhWK
A2grY6n%2DfpzrZ%2DEH4Gaxi0eaC97Oe6Noz0N2gQ848VkRCY%
5FfluyNnryL5AS%5F6HhQtupVGCY8patrBxvPzydMgfFb9Uf8eRjS%2
DIcZlF3Zx4wZ2EDhVNdnrNoSxj4gfcS3Mq8OyeJJlItLPn2ZZ0Wd
S6ws8XZkSFpfliyS6m&yargs=www.cs.odu.edu" etc...>


The actual URL for the search result can’t be found in the HTML! Apparently they use Overture, a marketing company now owned by Yahoo, to redirect you to the proper URL. They are probably using this strategy for at least two reasons: 1) scrapping their pages for search results won’t get you anything, and 2) they make money giving Overture information about what you click on when performing certain searches.

Unfortunately, it doesn’t look like Accoona is allowing access to cached versions of pages. Maybe they want to save bandwidth, maybe they don’t want to deal with possible lawsuits that caching might entangle them in.

There doesn’t appear to be any advanced search, and there isn’t any help pages describing how to use parameters in searches. I was able to make queries using “site:” and “url:” parameters which are popular on other search engines.

Bottom line: not useful for Warrick just yet.

Thursday, March 09, 2006

Yahoo API has gone 404!

Just today I noticed that my Perl scripts (including Warrick) running the Yahoo Web Search API were consistently returning zero results. I tried running the scripts from different networks, but had the same problem. I scanned the Yahoo API web pages to see if they were alerting us to any problems and nothing.

So I though maybe I should download the most recent version of the API just in case something has changed. I tried downloading yws-1.2.tgz and yws-1.2.zip but got the same error page (right).

I also tried sending an email to the Yahoo list (yws-search-web@yahoogroups.com) but never received the email as a member of the group. Something is really fishy…

Update on 3/13/06:

Yahoo finally admitted the error today. Acording to Toby Elliott, the Yahoo API guy, the error is due to a "default variable in the perl library" and a fix should be out soon. Several other developers complained to the Yahoo list serve, but all the emails were held until today. One noted that the error first appeared on Mar 3. That means it took 10 days for the API error to be acknowledged. Ouch! Was it because Toby was on vacation? ;)

Something else strange: a few days ago I did a Google search for "yahoo api" which resulted in two hits: one pointing to developer.yahoo.net and one to developer.yahoo.com. The sites appeared to be mirrors of each other. One big difference though, the download page for the .com resulted in 404s (see above) but the .net site didn't. A googling of "yahoo api" today just shows only the .net listing. I guess I caught them in the middle of their migration from one TLD to another.

Update on 3/27/06:

I was able to find a working Perl library for Yahoo's Web Search API at CPAN.

Monday, March 06, 2006

Google Video

I did a lot of playing with Google Video yesterday. It’s a service that came online in 2005 and allows users to upload and view free video as well as commercial video. Their price point of $1.99 per video matches iTunes. Neither service offers “bundling” (yet) which would make it cheaper for users to purchase large collections of video (you have to pay $1.99 for each video in The Office Season 1 instead of, say, $9.99 for all six episodes.)

In Feb 2006, it was announced that the National Archives was teaming up with Google Video to allow “researchers and the general public to access a diverse collection of historic movies, documentaries and other films from the National Archives via Google Video as well as the National Archives website.” This is sure to give Google Video even more momentum and legitimacy.

My favorite thing about Google Video is that they tend to have many commercials available. I was able to find Apple’s classic 1984 Super Bowl commercial, Terry Tate Office Linebacker, and the Outpost.com commercial where they tattooed their insignia on toddlers (no actual toddlers were tattooed in the making of the video ;) ). They also have a cool feature which allows you to embed a Google Video directly into your web page and stream the video from Google.

One feature that is noticeably absent is a filtering mechanism. Although Google’s policy is to reject video that is pornographic or obscene, there are quite a few videos that are certainly as close to that line as possible. I would like to be able to filter out the trash although I realize that would create a lot more work for Google.