Tuesday, October 17, 2006

Back from the Bahamas

Becky and I returned Saturday morning from our cruise to the Bahamas. Our room on board was spacious, and we loved getting breakfast in bed and twice-daily room service. We had a great time snorkeling in Nassau and the Atlantis hotel, and the beach in Freeport was incredible. I celebrated my birthday onboard and got to blow out candles on a cake of Baked Alaska.



Besides getting to spend a lot of time with Becky, I also had plenty of time to read. I brought along a book called The Language of God: A Scientist Presents Evidence for Belief by Francis S. Collins, the head of the Human Genome Project. Collins does a fantastic job at presenting how science and faith can be and should be integrated. Collins is an excellent communicator, and he especially does a good job of summarizing the current state of scientific knowledge. I highly recommend this book to the atheist, agnostic, and believer. I found the Language of God and Finding God at Harvard (which I just completed last month) to be very encouraging to my faith.

Now it’s back to the grindstone...

Saturday, October 07, 2006

Storm before the calm

Becky and I are going on our first cruise next week. We depart Norfolk tomorrow for Nassau and then to Freeport. This was supposed to be the "celebrate finishing the dissertation" cruise that we were going to take next summer, but the baby changed all that. Becky just finished her first trimester, so she should be feelin’ good for the cruise.

I’m glad to be getting out of town... yesterday a nor’easter came crawling into town. Combined with the high tide and full moon, my neighborhood was flooded just as bad as it was when Ernesto came through. I took these photos this morning.


Colonial Place mermaid

Apartment building next door

House next door

Our parking lot

Friday, October 06, 2006

Mark Foley websites recovered

Mark Foley, a Congressman from Florida, resigned on Sept 29, 2006 over allegations of inappropriate emails to minors who worked as Congressional pages. It was brought to my attention on Tuesday (thank you Martha!) that his websites

http://www.house.gov/foley/

and

http://www.markfoley.com/

were both shut off after the resignation. I have reconstructed both sites using Warrick and made them available here.

Become.com's web crawler

Today a member of the Heritrix list serve pointed everyone to an article on Sun’s website that discusses Become.com’s web crawler. The article dates back to August of 2005, so it’s a little dated. I couldn’t find any updated information on the crawler, but apparently it is proprietary, and the source code will likely never see the light of day.

Become.com actually developed 2 crawlers in 2004- one written entirely in Java and the other mostly Java with some C++. The article states that the crawlers "may be the most sophisticated, massively scaled Java technology application in existence."

The article doesn’t mention anything about Heritrix, a crawler which is also completely written in Java. Although Heritrix doesn’t currently have a distributed architecture, it could still be deployed in such an environment. It would be really interesting to see the two crawlers compete at the National Java Crawling Championships.

Tuesday, September 19, 2006

Software Engineering: Best Job in America

Money Magazine has listed software engineering as the number one Best Job in America. Even with the threat of off-shoring jobs, a computer science degree isn’t looking so bad after all. In fact, of the top 10 jobs listed, 4 of them could be obtained by someone with a CS degree. Having been both a software engineer and now a college professor (number two on the list), I can attest that both jobs are very rewarding.

Club a la 700

This morning I sat in the audience during the taping of the 700 Club, the daily news show hosted by Pat Robertson, Gordon Robertson, and Terry Meeuwsen. The 700 Club is filmed at the CBN headquarters which is located on the Regent University campus in Virginia Beach. The show is broadcasted daily on several cable stations.

When Becky and I first moved to Virginia, our tourist book mentioned that the 700 Club was filmed in Virginia Beach, and I really wanted to see it being filmed. Becky thought that I was crazy. I’m not a frequent viewer of the 700 Club, but I’ve always wanted to see some TV show being filmed live, and this is the first opportunity I’ve ever had. Living in Searcy, Arkansas doesn’t present many opportunities. Since I’ve got tons of spare time these days, I figured I could take off one morning to see the taping. wink

I was only one of five audience members (apparently the audience size fluctuates from day to day). Pat came out before the show began and said a quick prayer for us and the show and then joined Terry who was already positioned at the news counter.

The show began with comments on the recent Pope vs. radical Islamic spat. There was a piece about Iranian president Mahmoud Ahmadinejad and then talk about Iraq. The show was mainly scripted, but Pat made several remarks off the cuff. I believe it’s these types of remarks which have gotten him in trouble in the past. After the piece on Ahmadinejad which focused on his desire to bring about the end of times, Pat exclaimed: "Well, that is just weird..." smile

Dr. Kevin Leman, a prominent Christian psychologist and author, was the single guest. My mom would have liked to meet Dr. Leman since she’s a big fan of his Birth Order Book. He was interviewed by Terry and talked about his new book on single parenting, but as soon as the interview was over he bolted. I guess I’ll get that autograph some other time. wink

After the taping Pat thanked us for coming, and Terry hung around and talked to each of us. We then were given a tour of the CBN building. It was like touring a museum since there were numerous pieces of original art including the largest painting of the Lord’s Supper produced in the 20th century. Our tour concluded with a very earnest prayer (which I really appreciated). After that I met up with Becky for lunch.

I really enjoyed getting a behind-the-scenes look at how a TV show is ran. The guys at CBN are real pros. Now it's back to the grind...

Friday, September 15, 2006

Our bean has a heart!

This morning Becky had her second visit to the obstetrician. Since we were going to hear the baby’s heart beat for the first time, I got to tag along. At just 11 weeks, our little bean (ok, he’s probably the size of a walnut now) has got a strong heart rate!
wucca-wucca-wucca-wucca-wucca...
Perfect for a future tennis star...

Tuesday, September 12, 2006

Thinking about next fall

The fall semester has just begun, and I again find myself wishing I was back in front of a classroom. As I contemplate returning to teaching next fall (God willing), there are several things I’d like to do differently. Here are three nuggets I have come across recently that have gotten me thinking about next fall:

1) Pair Programming

Pair programming is a relatively new way of teaching students how to program in CS1 classes. I first learned of pair programming in a recent CACM article entitled "Pair Programming Improves Student Retention, Confidence and Program Quality" by McDowell et al (2006). McDowell has written on the topic since 2002 and even provides a video on how to teach pair programming.

In McDowell's experiments, they found that more students completed their course, were more satisfied with their work, and stuck with the CS major than students that had to program independently. They also found that the paired students performed similarly on the final exams which means the paired students were learning to program just as well as the non-paired students. The article also discusses how this could be useful for retaining women in CS.

Like every other university, Harding has seen a downward trend in the number of CS majors over the last several years. I think paired programming may be one of the tools we could use to assist us in building back up enrollment. I’m excited to try it in my introduction to programming classes soon.

2) Programming with Alice

Becky recently met an instructor who was teaching middle school and high school teachers about Alice, an alternative programming language which teaches programming by manipulating objects in 3D space. When the instructor leaned that Becky’s husband taught programming courses, he handed her an intro to Alice textbook which she passed on to me.

Alice was developed at Carnegie Mellon, but it's soon to be overhauled by Electronic Arts which will give it a really crisp look that the XBox generation has learned to expect. Alice allows students to quickly get an animation running without the boring details of variable declarations and semicolons.

I’d really like to teach Alice to a group of students who have no programming background and see how they do. If I can get them excited about Alice, I may be able to get them excited about computer science.

3) Limit wireless Internet access during class

Dennis Adams wrote a fantastic Viewpoint piece entitled "Wireless Laptops in the Classroom (and the Sesame Street Syndrome)" in the September issue of the CACM (2006). Adams, a professor at the University of Houston, opened the article with two examples of “laptops gone wild” in the classroom: a 2002 school brochure which inadvertently contained a photo of a student playing Solitaire while his unaware professor lectured, and 2) a Wall Street Reporter who witnessed several students using chat rooms, IM, and surfing the Web while Adams lectured.

Adams rightly points out that professors need an off button- a way to turn off wireless access or at least limit access during class. Professors cannot be expected to compete with Google, IM, Solitaire, FaceBook, et al. during their lectures or provide constant infotainment. No professor, no matter how good, can compete with the infinite number of distractions that the Internet places just inches in front of a student. Adams is not alone in his assessment.

I have personally used embarrassment tactics as a means to ensure only proper use of laptops while I lectured, and I’ve had some modest success. I called out one of my students who was using MS Paint to create a picture while I lectured, and every day after that he sat very alertly. I’ve been fortunately enough to teach to small classrooms where it’s easy for me to roam and see what my students are doing, but such a strategy certainly doesn’t work in large classroom settings.

Someone is going to create some software that allows only limited Internet connectivity during class, and that person is going to make a killing. And of course the guy who invents the software to circumvent the policing software is going to be the next Shawn Fanning.

Monday, September 11, 2006

Remembering 9/11

Today is the five year anniversary of the 9/11 terrorist attacks. I didn't know anyone personally who was lost in the attack, but like everyone else, it affected me deeply.

My friends and I were talking over lunch yesterday about how we first found out about the attack. I was listening to Fox and Friends while shaving and getting ready for a day of teaching (it was my fifth year teaching at Harding Univ). I remember one of the hosts saying that they thought a plane had struck one the Twin Towers. I ran over to the TV and couldn’t believe the site... what kind of horrible pilot would make such a huge mistake? I watched for a little while and then witnessed the second tower being hit; this was no accident. There wasn't anyone around to talk to about it (my roommate had already left), so I raced over to the office and found Dana Steil (a colleague of mine in the CS department) who hadn't heard about it. Dana turned on the radio, and I just sat at my desk wondering what was going on. A few minutes later we went to chapel, and I think I remember President Burks announce what was happening, and we prayed for quite a while about the situation. I don’t really remember what happened the rest of the day... it’s a fog.

Although I normally avoid the many 9/11 shows and movies that come out near the anniversary, I decided to see the World Trade Center (starring Nicholas Cage) this weekend after several friends had recommended it. Becky and I were genuinely moved by the courage of the men and women who ran into a huge disaster just to complete strangers. I especially liked the Dave Karnes character who, feeling that God was calling him to action, raced to Ground Zero from his home in Connecticut to search for survivors after everyone else had called it a day.

It’s tough to watch parts of this movie, but you will definitely leave the theatre knowing that courage and goodness are not in short supply in America.

May God bless and comfort those who are still deeply scared by these attacks.

Saturday, September 09, 2006

Google's cached date = last request date

Vanessa Fox, a member of the crawl team at Google, announced on Tuesday that Google would start posting the last request date on their cached pages. Google used to only indicate the date that the page was last retrieved, so if Google made an If-Modified-Since request and the web server responded with a 304 (not modified) response, the cached date would be left unchanged. Now the cached date will indicate the date of the 200 or 304 response. Matt Cutts also discussed this change and even made a little video for those that needed a visual explanation.

Frankly, I was very surprised to learn that Google’s cache date worked this way. In effect, it’s was much like Yahoo’s Last-Modified date… it was really just an indication as to when they noticed the page changed. I have crawl data from 2005 that indicates Google would periodically issue regular HTTP GET requests, possibly just to verify that the content had indeed not been changing.

I’m not totally sure what MSN’s cache date is indicating. From my 2005 crawl data, MSN apparently never issued an If-Modified-Since request. If they are still operating with the same crawl policy, then they are storing the time they last contacted the web server, so their cache date would indicate the same thing as Google’s.

What this means for Warrick: Google will more frequently now have the most recent version of a page. Therefore Google’s overall percentage of contributed resources will likely increase in the reconstructions I’ve been performing the last few weeks.

On a side note, someone asked Matt Cutts why Google does not post the cached date of PDFs, and Matt said he was going to ask the crawl team about it.

Friday, September 01, 2006

Rainy Friday

The Hampton Roads area is currently being doused by tropical storm Ernesto. Regent, ODU, and practically every school in the area has shut-down due to the flooding. I attempted to drive over to ODU just for fun, but I was stopped at every turn as the photos belowillustrate. It's just the beginning of a beautiful Labor Day weekend. :)


. . .

View it while you can: Microsoft UK commissioned Ricky Gervais and Stephen Merchant to produce a couple of spoof training videos with Gervais playing David Brent, the boss from the UK’s The Office series. For some reason, Microsoft apparently didn’t want these videos to ever be shown publicly and have started an investigation as to how the videos were leaked. The videos still remain on Google Video, so watch them while you can… they are hilarious!

. . .

Agassi- you are a stud. How far can he go on his final tournament stop, the US Open? More importantly, can I get Becky to go along with naming our future son (if it’s a boy) Agassi McCown?

. . .

Thanks, Elrod, for taking some photos of the new chapel at Harding. It looks great!

. . .

Congratulations to Becky on being awarded Regent University’s Employee of the Month. She received a choice parking spot for September, a little spending money, and a letter of recommendation signed by Pat Robertson (you can’t have too many of these sitting around the house). Becky also was awarded the School of Education’s Staff Member of the Year last May.

An employee will go home and ask his neighbor, "Hey, did you get an award?" "No man. I mean I slave all day and no one notices." Next thing you know, he smells something funny from his neighbor's house. Neighbor hanged himself due to lack of recognition. - Michael Scott (The Office)

Monday, August 28, 2006

Bayside League: 2006-2007 season

Saturday night was the Bayside League’s third annual fantasy football draft. Mark Velez was kind enough to host it at his place. There were 14 teams- just 2 shy of having 2 leagues. Chris Deny, last year’s champion, drafted by phone, and I drafted my dad a team.

I got to draft second, and since Payton Manning was taken first, I was able to nab who I hope will be the best running back this year in the NFL: Larry Johnson. I was also able to draft a few Broncos and Cowboys- I just can’t help but play favorites.

Last year I came in second place. Let’s see what I can do this year…

  1. RB – Larry Johnson
  2. RB – Willis McGahee
  3. QB – Jake Plummer
  4. WR – Donald Driver
  5. TE – Chris Cooley
  6. WR – Andre Johnson
  7. DEF – Cowboys
  8. K – Josh Brown
  9. RB – Mike Bell
  10. QB – Mark Brunnell
  11. WR – Nate Burleson
  12. K – Lawrence Tynes
  13. DEF – Patriots
  14. TE – Desmond Clark

Thursday, August 24, 2006

Funny Commercials

One of my hobbies is “collecting” commercials. I know that sounds odd. What I mean is I am a big fan of commercials, and I like to download or keep “pointers” to commercials that I think are really hilarious, entertaining, or just plain fantastic. Google Video and YouTube are two of the best sources for commercials. In fact my favorite commercial this year is the Liberty Mutual “pay-it-forward” commercial which is now available on YouTube:



When we got home from church last night, we caught the “World’s Funniest Commercials” on TBS. In general I was disappointed by the crassness and message of many of the commercials, but there were a few gems:

LA Country Fair – "Duh, Ashley, all wool comes from a cow..."
1-800-Got-Junk – “Rat Advertising Trial (R.A.T.)”
Avis: Gansta Rap – “Gotta get that money made”
Solo Mobile: Housewarming – “You are a legend”

Honorable mention: Pacman Puppet Show

All the commercials can be seen in high quality at veryfunnyads.com.

Tuesday, August 22, 2006

The ACLU and nonsectarian prayer

Last week, a federal judge dismissed a Fredericksburg City Council member's lawsuit challenging the council's nonsectarian prayer policy. Ironically, the ACLU of Virginia is fighting in this case to limit the liberties of an individual rather than protect them. It’s not quite so surprising when you learn their opponent is a Christian who desires to use the name “Jesus Christ” in an opening prayer for the city council.

A lawyer friend of mine who is intimately familiar with this case offered the ideal ACLU prayer, guaranteed to be as nonsectarian as possible:

"At this point we would like to call on a genderless, nameless higher power than ourselves and invoke that being (or those beings) intervention (or non-intervention as the case may be) upon this governmental body."

If it ever does come down to that, most of us would rather have no prayer at all. That’s exactly what the ACLU is hoping as well.

Hypertext 2006

Michael is in Odense, Denmark this week presenting two papers at the Hypertext 2006 conference:

  1. Evaluation of Crawling Policies for a Web-Repository Crawler by McCown and Nelson
  2. Just-In-Time Recovery of Missing Web Pages by Harrison and Nelson

I should be there presenting the crawling policies paper, but Michael graciously went in my place so I could stay here and work. Our papers are 2 of the 12 that will be presented.

Lazy Paper accepted to WIDM'06

I got some really good news yesterday- my Lazy Preservation paper (“Lazy Preservation: Reconstructing Websites by Crawling the Crawlers”) was accepted for publication at the WIDM’06 workshop along with Joan’s mod_oai paper. Only 11 of the 51 submissions (21.5%) were accepted which means the workshop participants are going to be getting very familiar with the research aims at ODU. ;-)

WIDM is in Arlington, Virginia on November 10 and is held in conjunction with CIKM’06. This will be a good opportunity for me to visit my sister again in D.C. I was up there a few weeks ago giving the good news about Becky’s pregnancy to my parents (they were visiting from St. Louis). Random event: while I was in D.C., my mom and I ran into Karl Rove at a bookstore- we chatted for a while (Mom’s a fan), and he seemed like a nice guy, but I'm sure most politicians do. Anywho, I’m looking forward to the workshop.

Thursday, August 17, 2006

Google Analytics

This morning I installed Google Analytics on my blog and ODU website. It’s a free tool that allows me to track how users enter, leave, and navigate my website. It involved simply posting some JavaScript (below) on the pages I wanted to track:
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-######-#";
urchinTracker();
</script>
It’s going to take a few weeks before there’s any data collected, but I’m really curious to see what this will reveal about the popularity of Warrick since I don’t have access to the CS web server logs.

Update on 8/25/06

It's been a week, and I'm now able to see some analysis of my blog and my ODU cs website. The screen below shows a summary of my blog's traffic from Aug 18-24:


Visitors increased from 11 to 26 during this week, and pageviews ranged from 25-49. Three quarters of the traffic are new visitors (Google is using cookies to track this).

The Geo Map Overlay is fascinating. My blog tends to appeal more to Americans and Europeans: I got only 2 visits from Australia, 1 from South America, and 1 from Africa. There were 14 visits from Tampere, Finland and 9 from Nokia (also in Tampere). The Visits by Source graph shows the Finish hits to be from Timo's nothingforsale.com website where I now have a link pointing to my blog.

Most people find my blog through Google. So what are people searching Google for that lands them on my blog?

Apparently my blog entry about Yahoo's error 999 is by far the most popular. Searches 1, 4, 5, and 9 will all return this entry in Google's top 10 results. The "shiri maimon" entry doesn't show up in Google until page 3 (top 30 results).

What pages are referring visitors? Apparently Elrod's blog is the biggest referrer so far. This is due to a comment I left on Aug 18.

What really surprises me is that anyone is visiting frankmccown.com, a website with no content. I did a search for "frank mccown" in Google, and the site came up number one. Come on Google... you guys are supposed to be punishing content-less sites like this, not promoting them. I guess Google's PageRank is far from what they originally published in 1998; there's maybe 1 or 2 links to this page from anywhere on the Web.

My cs website is getting a little more traffic than my blog even though I only have the tracking on a few of the pages. What was most interesting to learn was that Wikipedia was by far the largest referrer, sending me around 60 referrals last week. Most of the referrals are coming from the Internet Archive entry where there's a pointer to Warrick.

The Warrick page received 125 visits and 175 pageviews last week (18 and 25 per day, respectively). Here are some search terms people are using to find Warrick:
  • google api convert documents to HTML
  • Warrick
  • Warrick website download
  • Warrick archive.org
  • warrick perl
  • cached archive website
  • recover website from google cache
  • google cache website recovery

Wednesday, August 16, 2006

Little McCown on the way

I just can’t keep it a secret anymore – Becky is seven weeks pregnant! The appointment with the OBGYN went really well this morning, and, God willing, we are expecting our first little McCown on April 5. That’s when I’ll be in the thick of writing my dissertation, so it will be a really exciting and challenging time! :)

Nothing for sale

When my friend Timo Kosonen emailed me back in January 2006 about his website http://www.nothingforsalesite.com/, I emailed him back saying that he was crazy- who would pay something for nothing? His idea roughly mirrored the http://www.milliondollarhomepage.com/ idea where people spent one dollar per pixel for on-screen real estate. Although it worked out well for the million dollar guy, I wasn’t so sure it would work out the same for Timo.

Well, Timo got some press out of it and was able to make a few hundred dollars (which he applied to his wedding). Not bad for a Harding grad. ;)
Way to go, Timo!

Tuesday, August 15, 2006

Torrance Daniels in Philly


Torrance Daniels (“Tank”) is the first player from Harding University to possibly play in the NFL. Right now he’s in training camp with the Philadelphia Eagles. Although I’m no Eagles fan (go Cowboys!), I’ll be pulling for him.

Update on 9/8/06:

It looks like Tank will be on the Eagle's practice squad. Not bad.

Update on 11/21/06:

Tank is now a starter, thanks to the season-ending injury to McNabb.

Update on 11/30/06:

Tank started last Sunday evening against the Colts and made the first tackle of the night on the opening kick-off. There's an article about it on the Harding website.

Friday, August 11, 2006

AOL releases search queries

On Sept 28, 2006, AOL released the search histories of more than 650,000 of its users (21 million queries) on its new research website. Although the data was stripped of personal identifiers, it still made privacy advocates extremely upset. AOL issued an apology 10 days later and yanked the data from their site, but it had already been replicated.

For a researcher involved in information retrieval, this data is a gold mine. Most researchers don’t have access to data like this. Unless you work for a search engine, you have to rely on search data from your institution or beg for it from other locations.

On the other hand, some search data could be linked to specific individuals, and I can see why that would be alarming to some. Perhaps there’s a middle ground? What if location data could be randomly swapped? For example, a search for “boston hair cut” could be changed to “denver hair cut”. Although this would make the location information worthless, all the other important information (query length, word length, subject matter) would still be present. Other heuristics could be applied to muddle the location. Of course this doesn’t address all the privacy issues, but it’s a start.

Many of the queries are very disturbing. Many of the queries deal with pornography, grief, and revenge. The queries are like the random private thoughts of their owners. Although they would likely never mutter this stuff to a friend, they have no problems entering it into a search box. One thing that is very clear, there are a lot of hurting people out there.

What I also found very interesting was the way people make their queries. The lengths of many queries are very long. Users are apparently adding more words to get better precision. As the Web has gotten much larger, it has become necessary to use more words. Just six years ago a long query would result in very few hits, but not anymore. Also people sometimes use slang or misspellings which would likely match fewer results. For example, one user entered “u” in several queries where “you” would obviously be more appropriate. Search engines may need to adopt to the use of slang and make automatic substitutions when possible.

It’s really too bad that AOL has received so much heat for what has happened, especially since other companies like Excite and AltaVista have done the same thing in the past. The difference today is that we are much more aware of privacy issues, and the queries are becoming much more tuned to individuals. I would still like to see Google, MSN, Yahoo, and others also give up some detailed search data like this in the future.

Wednesday, August 09, 2006

Crawling the Web is very, very hard…

I’ve spent the past couple of weeks trying to randomly select 300 websites from a dmoz.org. There were only a couple of restrictions I placed on the selection:
  1. The website’s root URL not redirect the crawler to a URL that is on a different host. If it does, the new URL should replace the old website URL.
  2. The website’s root URL should not appear to be a splash page for a website on a different host or indicate that the website has moved to a different host. If it does, the new URL should replace the old website URL.
  3. The website should not have all of its contents blocked by robots.txt. If some directories are blocked, that’s ok.
  4. The website’s root URL should not have a noindex/nofollow meta tag which would prevent a crawler from grabbing anything else on the website.
  5. The website should not have any more than 10K resources.

The restrictions seem very straightforward, but in practice they are very time consuming to enforce. Requirement 2 requires me to manually visit the site. Did I mention not all the sites are in English? That makes it even more difficult. Requirement 3 means I have to manually examine the robots.txt. Req. 4 requires manually examination of the root page, and req. 5 means I have to waste several days crawling a site before I know if it is too large or not.

I guess I could build a tool for req 3 and 4, but I’m not in the mood.

Anyway, I ended up making about 50 replacements (at least) and starting my crawls over again. Now I finally have 300 websites that meet my requirements.

In the past I’ve used Wget to do my crawling, but I’ve decided to use Heritrix since it has quite a few useful features missing from Wget. But Heritrix isn’t perfect. I made a suggestion that Heritrix show the number of URLs from each host remaining in the frontier when examining a crawl report:

http://sourceforge.net/tracker/index.php?func=detail&amp;amp;aid=1533116&group_id=73833&atid=539102

Right now it is very difficult to tell if a host has been completely crawled or not. I would love to work on this myself, but I just don’t have the time right now. Maybe I'll get a student to work on this next time I'm teaching. ;)

The other difficulty with Heritrix is in extracting what you have crawled. I will need to write a script that will build an index into the ARC files so I can quickly extract data for a website. Since all the crawl data is merged into a series of ARC files, it is really difficult to throw away crawl data for a website you aren’t interested in. I could write a script to do it, but at this point it’s not worth my time.

Anyway, crawling is a very error-prone and difficult process, but I can’t wait to teach about it when I return to Harding! (I'm really getting the itch to get back in the classroom.)

Tuesday, August 08, 2006

CiteULike - a second brain for researchers

Today I discovered a new free tool for researchers to manage their references. It’s called CiteULike, and it’s been around since 2004. Richard Cameron created the site and intends to keep it free.

I was able to upload all my BibTeX entries without any difficulties. I can now tag each entry so I can quickly see what papers are relevant to a particular subject. For example, here are all the papers I have tagged for web-archiving:

http://www.citeulike.org/user/fmccown/tag/web-archiving

CiteULike is useful for finding out about new research in your area. For example, this user apparently has many of the same interests as me, and I found several new papers by browsing his library:

http://www.citeulike.org/user/ChaTo/

It’s cool because I can even see comments that users have made about specific papers.

Now when I come across a new paper, I can add it to my CiteULike library and jot a quick note about it and not worry months later when I need to find the paper. And now instead of emailing Michael my BibTeX file, he can download the whole thing directly from the Web.

Thursday, August 03, 2006

Yahoo transforming FRAME tags

The past several months I’ve been ramping-up for a huge experiment where I’ll be reconstructing several hundred websites. I’ve been learning to use Heritrix and process ARC files, and I’ve been periodically tweaking Warrick. Today I found out that Yahoo has changed the way it caches HTML pages that contain frames.

For example, the page at http://www.harding.edu/comp/ contains the following HTML:

<FRAMESET COLS="195,*" FRAMEBORDER=no FRAMESPACING=0>
<FRAME SRC=menu.html NAME="MENU" MARGINWIDTH=0 MARGINHEIGHT=0>
<FRAME SRC=welcome.html NAME="MAIN">
</FRAMESET>

In Yahoo’s cached page for this URL, the FRAME tags are converted to the following (I’ve added some white space for readability):

<frameset rows="200,*"><frame scrolling="no" noresize="" frameborder="0" marginwidth="0" marginheight="0" src="http://216.109.125.130/search/cache?.intl=us&u=www.harding.edu%2fcomp%2f&
w=%22harding+.edu%22&d=XS7fRmP9NNYx&origargs=p%3durl%253Ahttp%253A%252F%252F
www.harding.edu%252Fcomp%252F%26toggle%3d1%26ei%3dUTF-8%26_intl%3dus&frameid=-1">


<FRAMESET COLS="195,*" FRAMEBORDER=no FRAMESPACING=0>

<frame security="restricted" MARGINHEIGHT="0" MARGINWIDTH="0" NAME="MENU" SRC="http://216.109.125.130/search/cache?.intl=us&u=www.harding.edu%2fcomp%2f&
w=%22harding+.edu%22&d=XS7fRmP9NNYx&origargs=p%3durl%253Ahttp%253A%252F%252F
www.harding.edu%252Fcomp%252F%26toggle%3d1%26ei%3dUTF-8%26_intl%3dus&frameid=1" >


<frame security="restricted" NAME="MAIN" SRC="http://216.109.125.130/search/cache?.intl=us&u=www.harding.edu%2fcomp%2f&
w=%22harding+.edu%22&d=XS7fRmP9NNYx&origargs=p%3durl%253Ahttp%253A%252F%252F
www.harding.edu%252Fcomp%252F%26toggle%3d1%26ei%3dUTF-8%26_intl%3dus&frameid=2" >


</frameset></FRAMESET>

Yahoo is placing their own FRAMESET tags around mine and loading the two column frames with pages directly from their cache. Notice the use of security="restricted" within the FRAME tag which tells the browser to place security constraints on the frame sources; this disables any JavaScript in my pages.

While this conversion of FRAME tags makes the page easier to view from their cache, it completely destroys the original HTML. There’s no way I can even parse through the arguments to tell what URL used to be in the SRC attribute. ARG! Now I’m going to have to add a rule to Warrick that tells it to ignore Yahoo cached pages that contain FRAME tags. Google and MSN have yet to implement this “trick”, and hopefully they never do.

Wednesday, July 19, 2006

That’s not a cache... that’s an archive!

This morning I stumbled across a March 2006 blog posting by Danny Sullivan entitled “25 Things I Hate About Google”. Sullivan’s opinions carry a lot of weight in the search engine world, and so I started to sweat when I saw number 9 on his list:

9. Stop caching pages: I was all for opt-out with cached pages until a court gave you far more right to reprint anything than anyone could have expected. Now you've got to make it opt-in. You helped create the caching mess by just assuming it was legal to reprint web pages online without asking, using opt-out as your cover. Now you've had that backed up legally, but that doesn't make it less evil.

Sullivan doesn’t agree with the January 2006 Nevada federal court ruling that declared Google’s cached pages did not constitute copyright infringement, thereby okaying the opt-out policy used by search engines using the noarchive meta-tag. Sullivan and others make some good points in the forum discussing the ruling, showing where the ruling may have some flaws.

One the arguments opponents of the ruling make is that a search engine cache is hardly a cache in the traditional sense because pages are cached long after they are changed or deleted from a web server. One of the posts by mcanerin gives an example of a web page that had been cached for almost 2 years (the example is no longer accessible). In mcanerin’s words: “That's not a cache, it's an archive.”

The fact that the cache is more like an archive is exactly what makes it beneficial to most Web users, and that’s why I think the court’s judgment was fair. Search engine caches are a huge public good. Caching is not evil. Yes, there may be a few scenarios where caching may not work to everyone’s benefit, but in most cases the good far outweighs the bad. As long as search engines provide a mechanism to keep crawled content from being cached and to remove cached content immediately if needed, then there is no really compelling reason to force search engines to use an opt-in policy. (Yes, I know it can be a real pain to manually remove entries from many search engines, but how often does anyone really need to do that?)

My research on digital preservation of websites relies heavily on the wide-spread use of search engine caching, and if caching turns from an opt-out to opt-in, I am going to be in serious trouble, and so are users of Warrick. I’ll be keeping my eye on this…

Immediate action required now

I was just reminded this week of the dangers of pornography when a friend of mine shared his personal struggles with it. It is ripping his life apart. There’s no doubt about it… this stuff is poison. It will poison your relationship with your spouse, your friendships with members of the opposite sex, and your soul. Don’t mess with it.

There’s a really good article about Steve Holladay and his struggles with pornography addiction in the Christian Chronicle (April 2006). Steve talked to our church one night about struggles with pornography and his ministry to reach out to youth who struggle with pornography addiction. (Steve was finishing his Ph.D. here in town at Regent University.)

One point the article made that I will share here is that pornography today is a much more dangerous beast than it was just fifteen years ago, and for that reason, it deserves your attention now. The three A’s - accessibility, affordability, and anonymity – illustrate this change. Fifteen years ago you had to physically visit a store selling pornography, pay for it, and reveal your identity. Today you can access pornography for free in your home or office and remain totally anonymous. And even if you don’t have any intentions on viewing pornography, it is emailed to you daily and appears in search engine results. You just can’t get away from it.

If you are concerned about your spiritual health and know that pornography is a strong temptation for you, there is absolutely no reason why you should not be using a filtering service like the American Family Filter. Of course filtering software isn’t going to block everything, and I’d recommend you go one step further and use accountability software like Covenant Eyes.

We owe it to ourselves and to our spouses to keep our conscience and minds clear of sexual immorality. God doesn’t ask anything less of us, and He has promised to give us strength to overcome it.
"He gives strength to the weary and increases the power of the weak."
- Isaiah 40:29

"And God is faithful; he will not let you be tempted beyond what you can bear. But when you are tempted, he will also provide a way out so that you can stand up under it."
- I Corinthians 10:13

Monday, July 03, 2006

Thinking Differently...

This past Thursday, Michael, Joan, and I gave a talk entitled “Thinking Differently about Web Page Preservation” at the National Digital Library Center (NDIIP briefing at Library of Congress in D.C.). Butch Lazorchak was our liaison. It was a great experience to give a talk in D.C. about my research. It also gave me an excuse to visit my sister, see some of the sites, and watch a Nationals game.

Update on 7/20/06:

The webcast is available from the Library of Congress Webcast page. My part runs from 16:50 - 47:15.

Friday, June 23, 2006

Heritrix - An archival quality crawler

This week I’ve been experimenting with Heritrix, the Internet Archive’s web crawler. It has some functionality that Wget doesn’t provide including:
  • limiting the size of each file downloaded
  • allowing a crawl to be paused and the frontier to be examined and modified
  • following links in CSS and Flash
  • crawling multiple sites at the same time without invoking multiple instances of the crawler
  • storing crawls in an Arc file
Since Heritrix was built with Java and was pre-configured to run on a Linux system, I didn’t have to expend much effort to get it to run on Solaris. I untarred the distribution file, set a couple of environment variables, started the web server interface, and boom it was working.

The interface is not exactly intuitive, and a near complete reading of the entire manual is required to put together a decent crawl. Of course if you want to use sophisticated open-source software, you usually have to put in some significant effort to get it to work right. Thankfully, several of the developers (Michael Stack, Igor Ranitovic, and Gordon Mohr) have been very helpful in answering some of my newbie questions on the Heritrix list serve.

In learning about Heritrix, I’ve put together a page on Wikipedia. Hopefully the entry will drum up more general interest in Heritrix as well. I was really surprised no one had created the page before.

Tuesday, June 20, 2006

Integer problems for the Google API

I’m not sure when it first started, but the Google API has been bombing out over the last few months when returning over 2^31 (2,147,483,648) results for a query. The API has bombed-out almost every day in June when my script searching for “database” and “list” which each return several billion results. Apparently Google’s SOAP interface is using a 32-bit integer for returning the total pages returned, but they need to be using a 64-bit long integer.

Michael Freidgeim made note of the problem on his blog a few weeks ago. Others have noticed this problem going back to April 2006. Who knows when Google will make a fix. If it's not one thing, it's something else... ;)

When searching to see when Google started using the larger total results, I came across a posting by Danny Sullivan that shows how he was attempting to use a “trick” to reveal how many pages Google has indexed. Danny suggested issuing a query that says, “give me all the pages that don’t have the word asdkjlkjasd.” I just tried –asdkjlkjasd on Google, and it gives me back 20.7 billion results. MSN gives around 5.2 billion results, but Yahoo and Ask won’t accept the query. Interesting…

frankmccown.com

I recently created my own website frankmccown.com using Microsoft Office Live. Since it was free to register the domain name and host the site, I thought I might as well give it a try. Thanks, Microsoft, for giving me a free site. The only problem I have is actually editing it. Microsoft tried to make the interface easy for business users to create a website. Unfortunately, they created an interface that is impossible to use for those of us who want to actually edit HTML. Where is the “edit HTML” button?!

I emailed the Microsoft Office Live folks to see if they could tell me how to edit the HTML, and they replied:
Microsoft Office Live does not support stand alone HTML code designing. HTML code designing can be accomplished using Microsoft Office FrontPage 2003. Currently, only the Microsoft Office Live Essentials subscription supports publishing the website through Microsoft Office FrontPage 2003.
Looks like you have to pay if you want to be able to change the under-lying HTML, and then you have to use FrontPage. Blah... Looks like frankmccown.com will not be getting much attention in the near future.

Friday, June 16, 2006

End of the Google 502 errors?

Google users have sporadically seen Google 502 (bad gateway) errors the last several years. The errors appear momentarily and then disappear. I’ve linked to a few postings about it according to date:

Mar 2003
July 2003
June 2005
Sept 2005
Nov 2005
Feb 2006
May 2006

Google API users have seen the 502 errors much more frequently:

Nov 2005, and another
Dec 2005
Jan 2006
Feb 2006, and another
May 2006

From my investigations, it looks like Nov 2005 is when the problems began. I have personally dealt with the problem ever since Mar 2006 when I integrated the Google API into Warrick. I had to add some logic to sleep for 15-20 seconds when encountering the error and then re-try.

In late May I started a new experiment which uses the Google API, and I’ve been monitoring it daily to see how many 502 errors I was receiving. From late May to June 6, I consistently received a 502 error for about 30% of my requests. On June 7, the number of 502s went down to zero. I have only received an occasional 502 out of hundreds of requests made daily.

Someone at Google finally got sick of the bad press and made some changes, and I’m thankful for it. :)

Thursday, June 15, 2006

JCDL 2006 - day 3

I’m back from JCDL. Overall I really enjoyed the conference, and I’m really hoping to attend next year’s conference in Vancouver, British Columbia. Mon and Tues were packed with activities, but everything was wrapped up on Wed morning. A few highlights from the remainder of the conference:

  • The dinner Tues night at the Ferrington Village was fantastic- I haven’t eaten that much pork in a long time. Everyone seemed to have a good time conversing. Johan took the top poster award, and Lagoze et al. took the top paper award.

  • Wed morning I enjoyed Johan’s presentation on aggregating and analyzing scholarly usage data although I had seen the presentation before at ODU. Some of the other presentations were well done, but I admittedly zoned-out a lot of it and went through my emails. I could tell many others around me were doing the same, just giving partial attention to the presentations. That’s got to be annoying to the presenters.

  • After the conference I was hoping to do a quick run over to see Duke’s campus, but it was pouring down rain. So Michael, Martin, and I headed back to Norfolk right after lunch. It was nice getting back and seeing Becky. Although the Carolina Inn was fantastic, there’s really no substitute for home.


Technorati Tag:

Tuesday, June 13, 2006

JCDL 2006 - day 1 and 2

I’m at JCDL 2006, hosted at UNC in Chapel Hill, NC. Michael brought Joan, Martin, and me down for the conference, and so far I’ve really enjoyed it. UNC has a really nice campus, and the Carolina Inn is deluxe. ;) Joan and I presented our dissertation abstracts at the Doctoral Consortium on Sunday, and I was able to get a few helpful suggestions. Ray Larson and the other faculty treated us to a fantastic dinner at Top of the Hill that evening. Since Michael is a co-chair of JCDL, we were unable to submit a paper, but that just means I can relax and enjoy the conference.

Here are a few highlights so far:
  • In the opening talk Monday morning, Daniel Clancy, Engineering Director of the Google Book Search, talked about Google’s efforts to digitize and index books from the G5, the five libraries that are cooperating with the digitization process. It was a very informative talk, and I certainly applaud Google for taking on such a massive and important project.

  • Andrew McCallum presented a paper about leveraging topic analysis and introduced rexa.info, a website like Google Scholar that displays published papers. The cool thing is how they also show co-authorship, authors that you site, and authors that cite you. They just had 2 of my papers indexed, but I guess that isn’t bad for a research project.

  • Carl Lagoze presented a paper that honestly addressed some of the shortcomings of the “low barrier” implementation of the NSDL. Turns out the implementation is rather people-intensive: problems include content providers unwilling to prove quality metadata and improperly implementing OAI-PMH. There was one notable absence from the references. At least one of the audience members publicly admitted being depressed at the current situation. I also do wonder about the future of a digital library that can’t scale without an enormous amount of people-intensive work. How do you build a DL that in many ways is competing with Google?

  • Johan gave a very in-your-face poster presentation: “Have any of you wondered about your funky JCDL reviews from last year?” Johan’s poster showed how the reviewers from last year’s JCDL were not reviewing papers based solely on their expertise. So why were non-experts judging papers that weren’t in their domain?

  • Bill Arms introduced me to Andreas Paepcke, a researcher at Stanford who works with WebBase/WebVac. Looks like they are making all their crawls available to other researchers who want them, but they won’t work for my website reconstruction research since it depends on real-time search engine content.

  • I talked some with Alesia Zuccala who presented her work with LexiURL, a piece of software written by Mike Thelwall. LexiURL uses the Yahoo API to report backlinks for a set of URLs. I really enjoy reading Thelwall's papers and hope to meet him at some point.

  • This morning Jonathan Zittrain gave a very entertaining and informative presentation about redaction, restriction, and removal of open information. It was one of the best presentations that I’ve seen, and his PowerPoint presentation was a fantastic example of how to put together a presentation. Even Tufte would have approved. Once of the most memorable slides showed the accidental grouping of two books on Amazon.com: a children’s book with “American Jihad”.
Tonight we’re being bussed out to Fearrington Village, home of the “oreo cows” for a pig pickin’. Yum.

Technorati Tag:

Thursday, June 08, 2006

Yahoo - Error 999

Yesterday I finally received the coveted “Error 999” page from Yahoo:
Sorry, Unable to process request at this time -- error 999.
Unfortunately we are unable to process your request at this time. This error is usually temporary. Please try again later.
If you continue to experience this error, it may be caused by one of the following:
  1. You may want to scan your system for spyware and viruses, as they may interfere with your ability to connect to Yahoo!. For detailed information on spyware and virus protection, please visit the Yahoo! Security Center.

  2. This problem may be due to unusual network activity coming from your Internet Service Provider. We recommend that you report this problem to them.
While this error is usually temporary, if it continues and the above solutions don't resolve your problem, please let us know.
Just like Google, Yahoo appears to also be monitoring for high volume traffic/automated requests and denying access for a period of time from infringing IP addresses.

I have a couple of scripts that make 300 queries per day to Yahoo using their web interface. 126 of my queries received the error yesterday, and 125 today. The scripts ran for 11 days before being detected. You’d think 300 queries wouldn’t be enough to trigger the response! I’m also making the 300 same queries using the API to see what the difference is in their responses.

I’ve seen others complain of the 999 error dating back to April 2004, but this is the first time I have personally experienced it. Murray Moffatt shares his experience with the error and some possible fixes. Basically all you can do if you are running a script and encounter the page is to sleep for several minutes and try again.

Update: 6/9/06

Today I increased the wait time between each query to a random number of seconds between 3-8. I also ran the script at 8:00 am EST instead of 2:00 am EST to see if blending into the croud helped at all. Today I received 133 error 999s. Not good. Possibly I'm being punished because I'm making requests at a high volume time. Next: increase the wait time to 15-20 seconds between each query.

Windows Live Academic Search

Microsoft launched Windows Live Academic Search (what I call Live Academic for short), a competitor for Google Scholar, a couple of months age (Apr 11, 2006 to be precise). According to their FAQ, they are harvesting material from open archives (like arXiv) using OAI-PMH. This is a different strategy than Google’s; Google is mainly indexing papers found on the Web.A rather detailed article by Barbara Quint about Live Academic which discusses how Microsoft learned from Google’s experiences and how Google is not feeling threatened by this newest entry in the search webosphere. Quint was impressed by the “very polished look” of Live Academic. I gave it a try, and here’s what I have to say about it:

Things I liked:
  • Display of the abstract and other metadata for the article on the right side of the screen.

  • The ability to click on an author’s name to search for other works by that author.

  • Support for BibTeX and EndNote.

  • The ability to sort by author, date, journal, and conference.

Things I did not like:
  • The attempt to produce a “snazzy” interface (using Ajax) which has a scroll bar that jumps around from time to time with no apparent explanation. Also if I used IE, it was almost impossible to highlight and copy text. Surprisingly Firefox on Windows had no such problem.

  • No advanced search. You can’t limit the search results to just computer science or search by author, journal, title, etc.

  • When using IA, the Back button on the browser frequently does not return to the previous page. Firefox on Windows sometimes also exhibited this behavior.

  • Intermittent problems with searching. For example, searching for "mod_oai" results in nothing being found. But it I search for “apache module for metadata harvesting” the paper with “mod_oai” appears in the title. But if I search for “apache module for”. (Correction: these problems appear to have been fixed overnight.)

  • Searching for authors with a middle initial can be problematic. A search for "michael l. nelson" (with quotes) seems to accurately locate many of Michael's publications. But if you click on "Michael L. Nelson" in one of the results, a search is made for authors matching "Nelson, M" which produces many false-positives.

  • I could not find a single one of my publications even though several of them are in arXiv. (Correction: this morning several of them now appear to have been indexed including my thesis.)

Overall I'd say stick to Google Scholar for now. But as Microsoft appears to be making some major improvements (literally overnight), my list of “didn’t likes” are bound to get much shorter.

Tuesday, June 06, 2006

Graphs in R

The last week or so I’ve been trying to learn the R programming language. The language was named R after the authors’ names, a really poor choice since it makes it almost impossible to search the Web for R-related web pages. One of my colleagues once stated that the R stands for “razor” as in what you feel like using on your wrists when trying to learn R! I have to agree- the intro material they supply is ok for learning a few basics, but I have yet to come across anything that shows all the basics of producing simple line and bar graphs. And I’m amazed by the poor examples in the user-contributed documents that give a little code with no pictures or pictures so small you have to magnify the image x10 to see anything. Therefore I have created my own intro to producing simple graphs in R. It’s by no means complete, but it’s much better than anything I’ve found.

Monday, June 05, 2006

Getting external backlinks from Google, Yahoo, and MSN

It’s often useful to know how many external backlinks are pointing to a particular URL. This metric can be used to partly determine a page’s popularity on the Web. A good tool for automating this process is the Backlink Analyzer which uses the Google, MSN, and Yahoo APIs using the "link:" command. The software allows a user to specify sites to ignore in the backlink counts, a useful function since the link: command returns backlinks from external and internal links for all three search engines.

I don’t know for sure if Backlink Analyzer is doing this or not, but for Yahoo and MSN, it is possible to perform a single command to give only external backlinks using a combination of link: and -site: parameters. For example, the following query will show all the pages pointing to my Warrick page for Yahoo and MSN:

link:http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html -site:www.cs.odu.edu

Google will not handle the –site: parameter successfully in this query although it does handle it in other types of queries.

Tuesday, May 30, 2006

Sacred Marriage

This weekend Becky took a trip out to Memphis to visit her family, and Sara came down to visit from DC, bringing along a friend. We hit the beach and did some camping, i.e., I got absolutely no work done. It was really nice to spend some time with my sister (although I can't wait to see my wife!).

What I did have time to do was read about half of Gary Thomas’ Sacred Marriage. Becky had read it recently and recommended I also read it. So far I have to say it’s a really great book for someone who is considering marriage or has been married for some time. Instead of looking at marriage as a “what can I get out of it?” point-of-view, Thomas tells his readers that marriage is an opportunity to improve who we are and love the way God loves us. Rather than make us happy, marriage is designed by God to make us more holy.

Here are a couple of passages I underlined this weekend:
Everything I am to say and do in my life is to be supportive of this gospel ministry of reconciliation, and that commitment begins by displaying reconciliation in my personal relationships, especially in my marriage.
Christians can command attention simply by staying married.
We can never love somebody “too much.” Our problem is that typically we love God too little.
As Betsy and Gary Riucci point out, “Honor isn’t passive; it’s active. We honor our wives by demonstrating our esteem and respect: complimenting them in public, affirming their gifts, abilities, and accomplishments; and declaring our appreciation for all they do. Honor not expressed is not honor.”
It is guaranteed that your spouse will sin against you, disappoint you, and have physical limitations that will frustrate and sadden you… This is a fallen world… You will never find a spouse who is not affected in some way by the reality of the fall.
I wouldn’t be surprised if many marriages end in divorce largely because one or both partners are running from their own revealed weaknesses as much as they are running from something they can’t tolerate in their spouse.
I seriously recommend this book to men and women who want to see more clearly what their marriage is all about. If we took these lessons to heart, it would make a huge difference in the lives of many.

OA debate - Eysenbach and Harnad

I’ve been following a rather lively debate on the American Scientist Open Access Forum between Gunther Eysenbach (a professor at the University of Toronto and editor-in-chief of JMIR), and Stevan Harnad (a professor at the University of Southampton and Open Archives "archivangelist"). Eysenbach published an article that showed the citation benefits of OA publishing: OA articles (articles which are freely accessible to the public) in Proceedings of the National Academy of Sciences (PNAS) were more than twice as likely to be cited one year later than non-OA articles (articles that must be paid for to access) published in PNAS.

Although Eysenbach and Harnad are both OA proponents, what appears to have stirred up the trouble was that Eysenbach’s article criticized several of the studies that Harnad was involved in (and failed to point to two recent studies), pointing out that they lacked a certain amount of statistical rigor and had some inherent fallacies. Eysenbach gives a detailed account on his website about the methodology of his paper which used multivariate analysis to account for known confounders (variables which are strongly associated with the outcome of interest) like the number of co-authors of a paper. Eysenbach argues that if a paper has multiple authors, it is more likely to be self-archived (green OA- see below). This is intuitively true (my paper on search engine coverage of the OAI-PMH corpus was self-archived by Xiaoming before I even gave it a second thought). But a paper is also more likely to be cited if it has more authors since each author is vested in citing their work. It’s also possible papers with multiple authors are of higher caliber (and hence will get cited more often) since there were more heads looking at the problem. Other factors like this one definitely need to be considered when trying to determine if OA is causing the increase in citations or not.

A big part of the argument stems around what is OA. There are two different flavors:
1. green OA - articles (including dissertations and preprints) are published in closed-access journals but are self-archived in an OA repository/archive or personal website. Green journals explicitly allow authors to self-archive their work.
2. gold OA – articles are published in OA journals where they are immediately accessible to the public for free. A gold journal may make all articles freely accessible or make only certain articles freely accessible by charging a fee to the author (which is usually paid by the author's institution or research foundation).

Although green OA is currently the most popular form of OA (5% gold, 90% green), it is sometimes difficult to test for since it’s possible an author will make their article publicly accessible the day it is accepted for publication or months after its been published. Gold OA is easier to test since the status is determined the first day it is published. Eysenbach tested for gold vs. green to see if papers that were self-archived but had closed access were any more likely to be cited than articles that were gold OA (it’s not clear how he discovered if a paper was self-archived; maybe he searched Google or maybe there was a way for an author to indicate if the paper was self-archived). He found that “self-archiving OA status did not remain a significant predictor for being cited.” This point appears to have also really bothered Harnad about the study.

I’ve learned a lot about OA from this debate. I just wish there was a little less animosity (zealousness?) from both sides. It’s a he-said/I-didn’t-say exchange which is now well documented on a public email forum which is archived on the Web, a blog, and in a letter to the editor: a prime example of how scientists air their differences today.

By the way, I just came across a really cool slide illustrating the access-impact problem between the Harvards and the have-nots (nice pun!) is on page 4 of Leslie Chin’s slides.

Thursday, May 25, 2006

Google limiting researchers to 1000 queries

I recently read a poster from ISSI 2005 entitled “Google Web APIs - an Instrument for Webometric Analyses?” The poster was written by Philipp Mayr and Fabio Tosques to introduce the Google API to webometric researchers. They ran several experiments to demonstrate that the API was useful. One experiment queried Google’s web interface and API with the term “webometrics” over 240 days. Their results showed a huge difference between the web interface and the API which made me wonder how you can consider an API useful if it gives you far different responses from what the rest of the world is seeing.

In their conclusion, Mayr and Tosques reported a limit of 10,000 requests per day. Google only allows 1000, so I emailed Mayr to see why they reported 10,000. He replied that Google would give researchers more queries, but when I emailed api-support@google.com requesting a bump up, they replied with this:
Due to overwhelming demand, we are no longer accepting requests for additional queries or for commercial use permission.
So researchers are in a quandary: use Google’s public web interface to perform searches which frequently (in my experience) leads to being blacklisted for hours at a time (even when less than 1000 daily queries are being made), or use the buggy (502 errors are common) API with only 1000 daily query limit which returns very different results than those obtained through the web interface.

Inspired by this dilemma, I have decided to put the APIs from Google, MSN, and Yahoo to the test. I am running a series of experiments comparing what the APIs return to what the web interfaces return. I’m hoping this will result in something that will give researchers a little more information on how to go about using search engines in their experiments and what to expect when using the APIs. Now if I can just find a free server that I can use to make requests for a few months…

Wednesday, May 24, 2006

MSN malware error screen


Looks like MSN is being targeted by hackers. I got the message below when I tried searching MSN for link:http://forums.absoft.com/viewtopic.php?pid=1932
We are seeing an increased volume of traffic by some malware software. In order to protect our customers from damage from that malware, we are blocking your query. A few legitimate queries may get flagged, and for that we apologize. Please be assured that we are hard at work on this problem and hope to get it resolved even better as soon as possible.
If you are using phpBB, please check out the phpBB downloads site http://www.phpbb.com/downloads.php and make sure you are not vulnerable.
- MSN Search Team

I did a search on Google to find out more, and apparently this has been seen by others:

Jan 2006:
http://www.emailbattles.com/archive/battles/vuln_aacgfbgdcb_jd/
http://forums.digitalpoint.com/showthread.php?t=47620
http://www.webmasterworld.com/forum97/716-3-10.htm

Feb 2006:
http://forum.abestweb.com/showthread.php?t=69268

May 2006:
http://www.webproworld.com/viewtopic.php?t=63478

I reported about this problem previously with Google. Hopefully MSN is not going to get as aggressive as Google about denying service to automated queries.

Tuesday, May 09, 2006

Server encoding caching experiment

To determine if my server-side component encodings could be inserted into indexable/cacheable HTML files, I ran a little experiment. I created 3 HTML files that contained encoded chunks in HTML comments at the base of each file:

html_encoded1.html - 2 KB
html_encoded2.html - 45 KB
html_encoded3.html - 99 KB

If you view the source of the pages, you’ll see something like this at the end:

<!-- BEGIN_FILERECOVERY
chunks = 4
filename = xor.o
recover = 2
orig_size = 1105
block_size = 554
block_num = 3

fY/xaGQn0V5MOOpLnM1WIsIUMirrVBQ2XNhidvc5yjL9tEyKTmNjNPjcrJzcPWvs INxxHl1Gt5lKQAYoNi1DXOhFI5ExBm15Nxx1T/hFCwVvsyaHsQQdd3lcqWJl+WTw BTlkiI8yWcPPoy38dqgTVnc4aSNd+0YQWW0bDl67/6XTnych3rSXn5YEYhVMU2eS LCR/0N4pAhKgeMb7SXtdJNQ6WykqDXYJAjtTOIrT2CLaPNRdKbU/ydsvUSDenSt+

Etc…
END_FILERECOVERY -->
I placed these files in my public_html folder on April 19, and linked to them from my index.html page. Today I checked Google, MSN, Yahoo, and Ask to see if any of them were cached. Here’s the results:

Google – cached all three
MSN – cached 1 and 2
Yahoo – indexed 2 only (not available in their cache)
Ask – nada

To see if Google can handle any more, I have created 4 new files of 150, 200, 250, and 300 KB. Looks like 99 KB is too large for MSN. Yahoo’s cache is really inconsistent- maybe 2 is in there, maybe it’s not. Why didn’t they grab 1?

I’ll check back in a couple of weeks and see if anything else has been cached.

Update: 6/20/06

Google and MSN have cached all files that range up to 300 KB. Yahoo has only indexed the first 3 (none are cached), and Ask has nothing.

Now I'm going to create a 400 KB, 500 KB, and 1 MB file and see what happens.

Update: 2/21/07

The cache limits for the search engines appear to be the following: Google - 977 KB, Yahoo - 214 KB, and MSN - 1 MB. I still cannot tell for sure what Ask's limit is, but I ran an experiment where I found 984 KB cached for a document that was 1.6 MB. Google's limit has been confirmed by others.

Yahoo Site Explorer

I just discovered Yahoo’s Site Explorer which was apparently released in September 2005. The tool allows you to see which pages of a site are currently indexed by Yahoo and the inlinks to a particular page. For example, I can see that Yahoo currently has around 1400 URLs indexed from my ODU website, and there are 19 inlinks pointing to the Warrick page. There is an API for accessing the service so page scraping is unnecessary. Now if only we can get Google to provide a similar service!

Monday, May 01, 2006

shiri-maimon.org is hacked

This weekend I was contacted by a fan of Shiri Maimon, an Israeli pop singer, who wanted to reconstruct shiri-maimon.org. This site was hacked recently, and all the files were deleted. The webmaster placed this explanation on the website:
April 20, 2006

This website has been hacked by someone. They have deleted everything, and I have decided that the website will NOT be back online. The reason for this is, that the people who did this, they hacked in just to delete everything. Which means, if I got everything back up and running - they could delete it the day after again. And I don't want to waste my time on that. Besides, the most important stuff, such as the forum, gallery and news is lost - and can't be restored. My host refuse to help me - even though they have the back-up files. So tomorrow I will cancel the domain. They say that they have the back-up files, but they can only re-upload everything if the files were lost during a server crash :-s They're practically writing to me, as if I deleted everything myself. They don't seem to get, that someone freakin' hacked the site! Like I would delete everything myself anyway :-s

You must all know by now, that I have spent endless hours - even weeks and months on this website. I'm very sorry to end the website I loved the most this way. It honestly breaks my heart. I feel really bad for both Shiri and the fans. I only tried to show my appreciation and wanted to spread the word about her. Apparently someone couldn't take that, and decided to ruin it for all of us. And they call themselves fans. Hah! Thanks a lot, whoever you are. I would like to thank all of you who kept visiting and coming back. It really meant a lot to me. Keep supporting Shiri out there ~ don't let the silence remain!

~ Camilla
I’m really surprised the hosting company would not recover the files for her. I’d let everyone know of my disappointment with the company. Looks like many of the pages are still in Google’s cache. I am glad Warrick will help get the site back.

It's becoming very apparent to me that third-party reconstruction is one of the primary things Warrick is useful for. If you don't personally own a backup, it's the only way you are going to get a site back.