Tuesday, May 30, 2006

Sacred Marriage

This weekend Becky took a trip out to Memphis to visit her family, and Sara came down to visit from DC, bringing along a friend. We hit the beach and did some camping, i.e., I got absolutely no work done. It was really nice to spend some time with my sister (although I can't wait to see my wife!).

What I did have time to do was read about half of Gary Thomas’ Sacred Marriage. Becky had read it recently and recommended I also read it. So far I have to say it’s a really great book for someone who is considering marriage or has been married for some time. Instead of looking at marriage as a “what can I get out of it?” point-of-view, Thomas tells his readers that marriage is an opportunity to improve who we are and love the way God loves us. Rather than make us happy, marriage is designed by God to make us more holy.

Here are a couple of passages I underlined this weekend:
Everything I am to say and do in my life is to be supportive of this gospel ministry of reconciliation, and that commitment begins by displaying reconciliation in my personal relationships, especially in my marriage.
Christians can command attention simply by staying married.
We can never love somebody “too much.” Our problem is that typically we love God too little.
As Betsy and Gary Riucci point out, “Honor isn’t passive; it’s active. We honor our wives by demonstrating our esteem and respect: complimenting them in public, affirming their gifts, abilities, and accomplishments; and declaring our appreciation for all they do. Honor not expressed is not honor.”
It is guaranteed that your spouse will sin against you, disappoint you, and have physical limitations that will frustrate and sadden you… This is a fallen world… You will never find a spouse who is not affected in some way by the reality of the fall.
I wouldn’t be surprised if many marriages end in divorce largely because one or both partners are running from their own revealed weaknesses as much as they are running from something they can’t tolerate in their spouse.
I seriously recommend this book to men and women who want to see more clearly what their marriage is all about. If we took these lessons to heart, it would make a huge difference in the lives of many.

OA debate - Eysenbach and Harnad

I’ve been following a rather lively debate on the American Scientist Open Access Forum between Gunther Eysenbach (a professor at the University of Toronto and editor-in-chief of JMIR), and Stevan Harnad (a professor at the University of Southampton and Open Archives "archivangelist"). Eysenbach published an article that showed the citation benefits of OA publishing: OA articles (articles which are freely accessible to the public) in Proceedings of the National Academy of Sciences (PNAS) were more than twice as likely to be cited one year later than non-OA articles (articles that must be paid for to access) published in PNAS.

Although Eysenbach and Harnad are both OA proponents, what appears to have stirred up the trouble was that Eysenbach’s article criticized several of the studies that Harnad was involved in (and failed to point to two recent studies), pointing out that they lacked a certain amount of statistical rigor and had some inherent fallacies. Eysenbach gives a detailed account on his website about the methodology of his paper which used multivariate analysis to account for known confounders (variables which are strongly associated with the outcome of interest) like the number of co-authors of a paper. Eysenbach argues that if a paper has multiple authors, it is more likely to be self-archived (green OA- see below). This is intuitively true (my paper on search engine coverage of the OAI-PMH corpus was self-archived by Xiaoming before I even gave it a second thought). But a paper is also more likely to be cited if it has more authors since each author is vested in citing their work. It’s also possible papers with multiple authors are of higher caliber (and hence will get cited more often) since there were more heads looking at the problem. Other factors like this one definitely need to be considered when trying to determine if OA is causing the increase in citations or not.

A big part of the argument stems around what is OA. There are two different flavors:
1. green OA - articles (including dissertations and preprints) are published in closed-access journals but are self-archived in an OA repository/archive or personal website. Green journals explicitly allow authors to self-archive their work.
2. gold OA – articles are published in OA journals where they are immediately accessible to the public for free. A gold journal may make all articles freely accessible or make only certain articles freely accessible by charging a fee to the author (which is usually paid by the author's institution or research foundation).

Although green OA is currently the most popular form of OA (5% gold, 90% green), it is sometimes difficult to test for since it’s possible an author will make their article publicly accessible the day it is accepted for publication or months after its been published. Gold OA is easier to test since the status is determined the first day it is published. Eysenbach tested for gold vs. green to see if papers that were self-archived but had closed access were any more likely to be cited than articles that were gold OA (it’s not clear how he discovered if a paper was self-archived; maybe he searched Google or maybe there was a way for an author to indicate if the paper was self-archived). He found that “self-archiving OA status did not remain a significant predictor for being cited.” This point appears to have also really bothered Harnad about the study.

I’ve learned a lot about OA from this debate. I just wish there was a little less animosity (zealousness?) from both sides. It’s a he-said/I-didn’t-say exchange which is now well documented on a public email forum which is archived on the Web, a blog, and in a letter to the editor: a prime example of how scientists air their differences today.

By the way, I just came across a really cool slide illustrating the access-impact problem between the Harvards and the have-nots (nice pun!) is on page 4 of Leslie Chin’s slides.

Thursday, May 25, 2006

Google limiting researchers to 1000 queries

I recently read a poster from ISSI 2005 entitled “Google Web APIs - an Instrument for Webometric Analyses?” The poster was written by Philipp Mayr and Fabio Tosques to introduce the Google API to webometric researchers. They ran several experiments to demonstrate that the API was useful. One experiment queried Google’s web interface and API with the term “webometrics” over 240 days. Their results showed a huge difference between the web interface and the API which made me wonder how you can consider an API useful if it gives you far different responses from what the rest of the world is seeing.

In their conclusion, Mayr and Tosques reported a limit of 10,000 requests per day. Google only allows 1000, so I emailed Mayr to see why they reported 10,000. He replied that Google would give researchers more queries, but when I emailed api-support@google.com requesting a bump up, they replied with this:
Due to overwhelming demand, we are no longer accepting requests for additional queries or for commercial use permission.
So researchers are in a quandary: use Google’s public web interface to perform searches which frequently (in my experience) leads to being blacklisted for hours at a time (even when less than 1000 daily queries are being made), or use the buggy (502 errors are common) API with only 1000 daily query limit which returns very different results than those obtained through the web interface.

Inspired by this dilemma, I have decided to put the APIs from Google, MSN, and Yahoo to the test. I am running a series of experiments comparing what the APIs return to what the web interfaces return. I’m hoping this will result in something that will give researchers a little more information on how to go about using search engines in their experiments and what to expect when using the APIs. Now if I can just find a free server that I can use to make requests for a few months…

Wednesday, May 24, 2006

MSN malware error screen


Looks like MSN is being targeted by hackers. I got the message below when I tried searching MSN for link:http://forums.absoft.com/viewtopic.php?pid=1932
We are seeing an increased volume of traffic by some malware software. In order to protect our customers from damage from that malware, we are blocking your query. A few legitimate queries may get flagged, and for that we apologize. Please be assured that we are hard at work on this problem and hope to get it resolved even better as soon as possible.
If you are using phpBB, please check out the phpBB downloads site http://www.phpbb.com/downloads.php and make sure you are not vulnerable.
- MSN Search Team

I did a search on Google to find out more, and apparently this has been seen by others:

Jan 2006:
http://www.emailbattles.com/archive/battles/vuln_aacgfbgdcb_jd/
http://forums.digitalpoint.com/showthread.php?t=47620
http://www.webmasterworld.com/forum97/716-3-10.htm

Feb 2006:
http://forum.abestweb.com/showthread.php?t=69268

May 2006:
http://www.webproworld.com/viewtopic.php?t=63478

I reported about this problem previously with Google. Hopefully MSN is not going to get as aggressive as Google about denying service to automated queries.

Tuesday, May 09, 2006

Server encoding caching experiment

To determine if my server-side component encodings could be inserted into indexable/cacheable HTML files, I ran a little experiment. I created 3 HTML files that contained encoded chunks in HTML comments at the base of each file:

html_encoded1.html - 2 KB
html_encoded2.html - 45 KB
html_encoded3.html - 99 KB

If you view the source of the pages, you’ll see something like this at the end:

<!-- BEGIN_FILERECOVERY
chunks = 4
filename = xor.o
recover = 2
orig_size = 1105
block_size = 554
block_num = 3

fY/xaGQn0V5MOOpLnM1WIsIUMirrVBQ2XNhidvc5yjL9tEyKTmNjNPjcrJzcPWvs INxxHl1Gt5lKQAYoNi1DXOhFI5ExBm15Nxx1T/hFCwVvsyaHsQQdd3lcqWJl+WTw BTlkiI8yWcPPoy38dqgTVnc4aSNd+0YQWW0bDl67/6XTnych3rSXn5YEYhVMU2eS LCR/0N4pAhKgeMb7SXtdJNQ6WykqDXYJAjtTOIrT2CLaPNRdKbU/ydsvUSDenSt+

Etc…
END_FILERECOVERY -->
I placed these files in my public_html folder on April 19, and linked to them from my index.html page. Today I checked Google, MSN, Yahoo, and Ask to see if any of them were cached. Here’s the results:

Google – cached all three
MSN – cached 1 and 2
Yahoo – indexed 2 only (not available in their cache)
Ask – nada

To see if Google can handle any more, I have created 4 new files of 150, 200, 250, and 300 KB. Looks like 99 KB is too large for MSN. Yahoo’s cache is really inconsistent- maybe 2 is in there, maybe it’s not. Why didn’t they grab 1?

I’ll check back in a couple of weeks and see if anything else has been cached.

Update: 6/20/06

Google and MSN have cached all files that range up to 300 KB. Yahoo has only indexed the first 3 (none are cached), and Ask has nothing.

Now I'm going to create a 400 KB, 500 KB, and 1 MB file and see what happens.

Update: 2/21/07

The cache limits for the search engines appear to be the following: Google - 977 KB, Yahoo - 214 KB, and MSN - 1 MB. I still cannot tell for sure what Ask's limit is, but I ran an experiment where I found 984 KB cached for a document that was 1.6 MB. Google's limit has been confirmed by others.

Yahoo Site Explorer

I just discovered Yahoo’s Site Explorer which was apparently released in September 2005. The tool allows you to see which pages of a site are currently indexed by Yahoo and the inlinks to a particular page. For example, I can see that Yahoo currently has around 1400 URLs indexed from my ODU website, and there are 19 inlinks pointing to the Warrick page. There is an API for accessing the service so page scraping is unnecessary. Now if only we can get Google to provide a similar service!

Monday, May 01, 2006

shiri-maimon.org is hacked

This weekend I was contacted by a fan of Shiri Maimon, an Israeli pop singer, who wanted to reconstruct shiri-maimon.org. This site was hacked recently, and all the files were deleted. The webmaster placed this explanation on the website:
April 20, 2006

This website has been hacked by someone. They have deleted everything, and I have decided that the website will NOT be back online. The reason for this is, that the people who did this, they hacked in just to delete everything. Which means, if I got everything back up and running - they could delete it the day after again. And I don't want to waste my time on that. Besides, the most important stuff, such as the forum, gallery and news is lost - and can't be restored. My host refuse to help me - even though they have the back-up files. So tomorrow I will cancel the domain. They say that they have the back-up files, but they can only re-upload everything if the files were lost during a server crash :-s They're practically writing to me, as if I deleted everything myself. They don't seem to get, that someone freakin' hacked the site! Like I would delete everything myself anyway :-s

You must all know by now, that I have spent endless hours - even weeks and months on this website. I'm very sorry to end the website I loved the most this way. It honestly breaks my heart. I feel really bad for both Shiri and the fans. I only tried to show my appreciation and wanted to spread the word about her. Apparently someone couldn't take that, and decided to ruin it for all of us. And they call themselves fans. Hah! Thanks a lot, whoever you are. I would like to thank all of you who kept visiting and coming back. It really meant a lot to me. Keep supporting Shiri out there ~ don't let the silence remain!

~ Camilla
I’m really surprised the hosting company would not recover the files for her. I’d let everyone know of my disappointment with the company. Looks like many of the pages are still in Google’s cache. I am glad Warrick will help get the site back.

It's becoming very apparent to me that third-party reconstruction is one of the primary things Warrick is useful for. If you don't personally own a backup, it's the only way you are going to get a site back.