Friday, June 23, 2006

Heritrix - An archival quality crawler

This week I’ve been experimenting with Heritrix, the Internet Archive’s web crawler. It has some functionality that Wget doesn’t provide including:
  • limiting the size of each file downloaded
  • allowing a crawl to be paused and the frontier to be examined and modified
  • following links in CSS and Flash
  • crawling multiple sites at the same time without invoking multiple instances of the crawler
  • storing crawls in an Arc file
Since Heritrix was built with Java and was pre-configured to run on a Linux system, I didn’t have to expend much effort to get it to run on Solaris. I untarred the distribution file, set a couple of environment variables, started the web server interface, and boom it was working.

The interface is not exactly intuitive, and a near complete reading of the entire manual is required to put together a decent crawl. Of course if you want to use sophisticated open-source software, you usually have to put in some significant effort to get it to work right. Thankfully, several of the developers (Michael Stack, Igor Ranitovic, and Gordon Mohr) have been very helpful in answering some of my newbie questions on the Heritrix list serve.

In learning about Heritrix, I’ve put together a page on Wikipedia. Hopefully the entry will drum up more general interest in Heritrix as well. I was really surprised no one had created the page before.

Tuesday, June 20, 2006

Integer problems for the Google API

I’m not sure when it first started, but the Google API has been bombing out over the last few months when returning over 2^31 (2,147,483,648) results for a query. The API has bombed-out almost every day in June when my script searching for “database” and “list” which each return several billion results. Apparently Google’s SOAP interface is using a 32-bit integer for returning the total pages returned, but they need to be using a 64-bit long integer.

Michael Freidgeim made note of the problem on his blog a few weeks ago. Others have noticed this problem going back to April 2006. Who knows when Google will make a fix. If it's not one thing, it's something else... ;)

When searching to see when Google started using the larger total results, I came across a posting by Danny Sullivan that shows how he was attempting to use a “trick” to reveal how many pages Google has indexed. Danny suggested issuing a query that says, “give me all the pages that don’t have the word asdkjlkjasd.” I just tried –asdkjlkjasd on Google, and it gives me back 20.7 billion results. MSN gives around 5.2 billion results, but Yahoo and Ask won’t accept the query. Interesting…

I recently created my own website using Microsoft Office Live. Since it was free to register the domain name and host the site, I thought I might as well give it a try. Thanks, Microsoft, for giving me a free site. The only problem I have is actually editing it. Microsoft tried to make the interface easy for business users to create a website. Unfortunately, they created an interface that is impossible to use for those of us who want to actually edit HTML. Where is the “edit HTML” button?!

I emailed the Microsoft Office Live folks to see if they could tell me how to edit the HTML, and they replied:
Microsoft Office Live does not support stand alone HTML code designing. HTML code designing can be accomplished using Microsoft Office FrontPage 2003. Currently, only the Microsoft Office Live Essentials subscription supports publishing the website through Microsoft Office FrontPage 2003.
Looks like you have to pay if you want to be able to change the under-lying HTML, and then you have to use FrontPage. Blah... Looks like will not be getting much attention in the near future.

Friday, June 16, 2006

End of the Google 502 errors?

Google users have sporadically seen Google 502 (bad gateway) errors the last several years. The errors appear momentarily and then disappear. I’ve linked to a few postings about it according to date:

Mar 2003
July 2003
June 2005
Sept 2005
Nov 2005
Feb 2006
May 2006

Google API users have seen the 502 errors much more frequently:

Nov 2005, and another
Dec 2005
Jan 2006
Feb 2006, and another
May 2006

From my investigations, it looks like Nov 2005 is when the problems began. I have personally dealt with the problem ever since Mar 2006 when I integrated the Google API into Warrick. I had to add some logic to sleep for 15-20 seconds when encountering the error and then re-try.

In late May I started a new experiment which uses the Google API, and I’ve been monitoring it daily to see how many 502 errors I was receiving. From late May to June 6, I consistently received a 502 error for about 30% of my requests. On June 7, the number of 502s went down to zero. I have only received an occasional 502 out of hundreds of requests made daily.

Someone at Google finally got sick of the bad press and made some changes, and I’m thankful for it. :)

Thursday, June 15, 2006

JCDL 2006 - day 3

I’m back from JCDL. Overall I really enjoyed the conference, and I’m really hoping to attend next year’s conference in Vancouver, British Columbia. Mon and Tues were packed with activities, but everything was wrapped up on Wed morning. A few highlights from the remainder of the conference:

  • The dinner Tues night at the Ferrington Village was fantastic- I haven’t eaten that much pork in a long time. Everyone seemed to have a good time conversing. Johan took the top poster award, and Lagoze et al. took the top paper award.

  • Wed morning I enjoyed Johan’s presentation on aggregating and analyzing scholarly usage data although I had seen the presentation before at ODU. Some of the other presentations were well done, but I admittedly zoned-out a lot of it and went through my emails. I could tell many others around me were doing the same, just giving partial attention to the presentations. That’s got to be annoying to the presenters.

  • After the conference I was hoping to do a quick run over to see Duke’s campus, but it was pouring down rain. So Michael, Martin, and I headed back to Norfolk right after lunch. It was nice getting back and seeing Becky. Although the Carolina Inn was fantastic, there’s really no substitute for home.

Technorati Tag:

Tuesday, June 13, 2006

JCDL 2006 - day 1 and 2

I’m at JCDL 2006, hosted at UNC in Chapel Hill, NC. Michael brought Joan, Martin, and me down for the conference, and so far I’ve really enjoyed it. UNC has a really nice campus, and the Carolina Inn is deluxe. ;) Joan and I presented our dissertation abstracts at the Doctoral Consortium on Sunday, and I was able to get a few helpful suggestions. Ray Larson and the other faculty treated us to a fantastic dinner at Top of the Hill that evening. Since Michael is a co-chair of JCDL, we were unable to submit a paper, but that just means I can relax and enjoy the conference.

Here are a few highlights so far:
  • In the opening talk Monday morning, Daniel Clancy, Engineering Director of the Google Book Search, talked about Google’s efforts to digitize and index books from the G5, the five libraries that are cooperating with the digitization process. It was a very informative talk, and I certainly applaud Google for taking on such a massive and important project.

  • Andrew McCallum presented a paper about leveraging topic analysis and introduced, a website like Google Scholar that displays published papers. The cool thing is how they also show co-authorship, authors that you site, and authors that cite you. They just had 2 of my papers indexed, but I guess that isn’t bad for a research project.

  • Carl Lagoze presented a paper that honestly addressed some of the shortcomings of the “low barrier” implementation of the NSDL. Turns out the implementation is rather people-intensive: problems include content providers unwilling to prove quality metadata and improperly implementing OAI-PMH. There was one notable absence from the references. At least one of the audience members publicly admitted being depressed at the current situation. I also do wonder about the future of a digital library that can’t scale without an enormous amount of people-intensive work. How do you build a DL that in many ways is competing with Google?

  • Johan gave a very in-your-face poster presentation: “Have any of you wondered about your funky JCDL reviews from last year?” Johan’s poster showed how the reviewers from last year’s JCDL were not reviewing papers based solely on their expertise. So why were non-experts judging papers that weren’t in their domain?

  • Bill Arms introduced me to Andreas Paepcke, a researcher at Stanford who works with WebBase/WebVac. Looks like they are making all their crawls available to other researchers who want them, but they won’t work for my website reconstruction research since it depends on real-time search engine content.

  • I talked some with Alesia Zuccala who presented her work with LexiURL, a piece of software written by Mike Thelwall. LexiURL uses the Yahoo API to report backlinks for a set of URLs. I really enjoy reading Thelwall's papers and hope to meet him at some point.

  • This morning Jonathan Zittrain gave a very entertaining and informative presentation about redaction, restriction, and removal of open information. It was one of the best presentations that I’ve seen, and his PowerPoint presentation was a fantastic example of how to put together a presentation. Even Tufte would have approved. Once of the most memorable slides showed the accidental grouping of two books on a children’s book with “American Jihad”.
Tonight we’re being bussed out to Fearrington Village, home of the “oreo cows” for a pig pickin’. Yum.

Technorati Tag:

Thursday, June 08, 2006

Yahoo - Error 999

Yesterday I finally received the coveted “Error 999” page from Yahoo:
Sorry, Unable to process request at this time -- error 999.
Unfortunately we are unable to process your request at this time. This error is usually temporary. Please try again later.
If you continue to experience this error, it may be caused by one of the following:
  1. You may want to scan your system for spyware and viruses, as they may interfere with your ability to connect to Yahoo!. For detailed information on spyware and virus protection, please visit the Yahoo! Security Center.

  2. This problem may be due to unusual network activity coming from your Internet Service Provider. We recommend that you report this problem to them.
While this error is usually temporary, if it continues and the above solutions don't resolve your problem, please let us know.
Just like Google, Yahoo appears to also be monitoring for high volume traffic/automated requests and denying access for a period of time from infringing IP addresses.

I have a couple of scripts that make 300 queries per day to Yahoo using their web interface. 126 of my queries received the error yesterday, and 125 today. The scripts ran for 11 days before being detected. You’d think 300 queries wouldn’t be enough to trigger the response! I’m also making the 300 same queries using the API to see what the difference is in their responses.

I’ve seen others complain of the 999 error dating back to April 2004, but this is the first time I have personally experienced it. Murray Moffatt shares his experience with the error and some possible fixes. Basically all you can do if you are running a script and encounter the page is to sleep for several minutes and try again.

Update: 6/9/06

Today I increased the wait time between each query to a random number of seconds between 3-8. I also ran the script at 8:00 am EST instead of 2:00 am EST to see if blending into the croud helped at all. Today I received 133 error 999s. Not good. Possibly I'm being punished because I'm making requests at a high volume time. Next: increase the wait time to 15-20 seconds between each query.

Windows Live Academic Search

Microsoft launched Windows Live Academic Search (what I call Live Academic for short), a competitor for Google Scholar, a couple of months age (Apr 11, 2006 to be precise). According to their FAQ, they are harvesting material from open archives (like arXiv) using OAI-PMH. This is a different strategy than Google’s; Google is mainly indexing papers found on the Web.A rather detailed article by Barbara Quint about Live Academic which discusses how Microsoft learned from Google’s experiences and how Google is not feeling threatened by this newest entry in the search webosphere. Quint was impressed by the “very polished look” of Live Academic. I gave it a try, and here’s what I have to say about it:

Things I liked:
  • Display of the abstract and other metadata for the article on the right side of the screen.

  • The ability to click on an author’s name to search for other works by that author.

  • Support for BibTeX and EndNote.

  • The ability to sort by author, date, journal, and conference.

Things I did not like:
  • The attempt to produce a “snazzy” interface (using Ajax) which has a scroll bar that jumps around from time to time with no apparent explanation. Also if I used IE, it was almost impossible to highlight and copy text. Surprisingly Firefox on Windows had no such problem.

  • No advanced search. You can’t limit the search results to just computer science or search by author, journal, title, etc.

  • When using IA, the Back button on the browser frequently does not return to the previous page. Firefox on Windows sometimes also exhibited this behavior.

  • Intermittent problems with searching. For example, searching for "mod_oai" results in nothing being found. But it I search for “apache module for metadata harvesting” the paper with “mod_oai” appears in the title. But if I search for “apache module for”. (Correction: these problems appear to have been fixed overnight.)

  • Searching for authors with a middle initial can be problematic. A search for "michael l. nelson" (with quotes) seems to accurately locate many of Michael's publications. But if you click on "Michael L. Nelson" in one of the results, a search is made for authors matching "Nelson, M" which produces many false-positives.

  • I could not find a single one of my publications even though several of them are in arXiv. (Correction: this morning several of them now appear to have been indexed including my thesis.)

Overall I'd say stick to Google Scholar for now. But as Microsoft appears to be making some major improvements (literally overnight), my list of “didn’t likes” are bound to get much shorter.

Tuesday, June 06, 2006

Graphs in R

The last week or so I’ve been trying to learn the R programming language. The language was named R after the authors’ names, a really poor choice since it makes it almost impossible to search the Web for R-related web pages. One of my colleagues once stated that the R stands for “razor” as in what you feel like using on your wrists when trying to learn R! I have to agree- the intro material they supply is ok for learning a few basics, but I have yet to come across anything that shows all the basics of producing simple line and bar graphs. And I’m amazed by the poor examples in the user-contributed documents that give a little code with no pictures or pictures so small you have to magnify the image x10 to see anything. Therefore I have created my own intro to producing simple graphs in R. It’s by no means complete, but it’s much better than anything I’ve found.

Monday, June 05, 2006

Getting external backlinks from Google, Yahoo, and MSN

It’s often useful to know how many external backlinks are pointing to a particular URL. This metric can be used to partly determine a page’s popularity on the Web. A good tool for automating this process is the Backlink Analyzer which uses the Google, MSN, and Yahoo APIs using the "link:" command. The software allows a user to specify sites to ignore in the backlink counts, a useful function since the link: command returns backlinks from external and internal links for all three search engines.

I don’t know for sure if Backlink Analyzer is doing this or not, but for Yahoo and MSN, it is possible to perform a single command to give only external backlinks using a combination of link: and -site: parameters. For example, the following query will show all the pages pointing to my Warrick page for Yahoo and MSN:


Google will not handle the –site: parameter successfully in this query although it does handle it in other types of queries.