Monday, August 04, 2008

ORE and preservation

Our network is down, so I thought this would be a good time to share what it is exactly I've been doing this summer at LANL. I'm working with the Digital Library Research and Prototyping team, led by Herbert Van de Sompel, which is located in the back of the Research Library. My work focuses on issues regarding the preservation of ORE Resource Maps. What are ORE Resource Maps? First, some background.

The Open Archives Initiative (OAI) has created the Object Reuse and Exchange (ORE) project which provides standards for defining and discovering aggregations of web resources. The ORE specs are currently in beta, and the 1.0 spec will be released at the end of Sept. You can read the Primer on-line, but I'll attempt to give the gist of it below.

An aggregation is a collection of web resources that make up a single, conceptual resource. For example, a scholarly publication may consist of several web resources: an HTML "splash page", a PDF version, a slideshow, and the raw data used to perform the research. An aggregation documenting a special event like 9-11 could be composed of images, video footage, news stories, and blog posts. Aggregated resources may reside on the same website, or they may be distributed across a number of websites.

While it's relatively easy for humans to determine the boundaries of an aggregation, it's extremely difficult for a computer. So the ORE Model introduces the concept of a Resource Map (ReM), a web resource that acts as an organizational unit, defining the boundaries of an aggregation and indicating the relationships between the aggregated resources. A computer can read a ReM and know for certain which resources belong together.

The figure below (taken from the Primer) shows a ReM which describes an aggregation (A-1) composed of three aggregated resources (AR-1, AR-2, AR-3). The relationships between the ReM, aggregation, and aggregated resources are indicated with RDF triples.



ReMs can be represented with RDF/XML, RDFa, and (most simply) with the Atom Syndication Format. An example ReM (borrowed from here using Atom) for a D-Lib magazine article is shown below. The ReM lists four aggregated resources shown in bold.

<?xml version="1.0" encoding="UTF-8" ?>

<atom:entry xmlns:atom="http://www.w3.org/2005/Atom">
<atom:title>Observed Web Robot Behavior on Decaying Web Subsites</atom:title>
<atom:updated>2007-09-22T07:11:09Z</atom:updated>
<atom:author>
<atom:name>Michael Nelson</atom:name>
<atom:uri>http://www.cs.odu.edu/~mln/</atom:uri>
</atom:author>
<atom:author>
<atom:name>Joan Smith</atom:name>
<atom:uri>http://www.joanasmith.com/</atom:uri>
</atom:author>
<atom:author>
<atom:name>Frank McCown</atom:name>
<atom:uri>http://www.cs.odu.edu/~fmccown/</atom:uri>
</atom:author>

<atom:link rel="alternate" type="text/html"
href="http://www.dlib.org/dlib/february06/smith/02smith.html" />
<atom:id>http://www.dlib.org/dlib/february06/smith/aggregation</atom:id>
<atom:link rel="self"
type="application/atom+xml"
href="http://www.dlib.org/dlib/february06/smith/aggregation.atom" />

<atom:category scheme="http://www.openarchives.org/ore/terms/"
term="http://www.openarchives.org/ore/terms/Aggregation"
label="Aggregation" />

<atom:link rel="http://www.openarchives.org/ore/terms/aggregates" type="text/html"
href="http://www.dlib.org/dlib/february06/smith/02smith.html" />
<atom:link rel="http://www.openarchives.org/ore/terms/aggregates" type="text/html"
href="http://www.dlib.org/dlib/february06/smith/pg1-13.html" />
<atom:link rel="http://www.openarchives.org/ore/terms/aggregates" type="application/pdf"
href="http://www.dlib.org/dlib/february06/smith/pg1-13.pdf" />
<atom:link rel="http://www.openarchives.org/ore/terms/aggregates" type="image/png"
href="http://www.dlib.org/dlib/february06/smith/MLN_Google.png" />


<atom:source>
<atom:author>
<atom:name>Dlib-Magazine</atom:name>
<atom:uri>http://www.dlib.org</atom:uri>
</atom:author>
</atom:source>
</atom:entry>


So what does this have to do with preservation? You are probably well aware that web pages and entire websites disappear from Web on a regular basis. Because of this, search engines like Google make pages available from their cache, and web archives like the Internet Archive regularly take snapshots of the Web. I like to call these combined preservation efforts the Web Infrastructure (WI).

If you're familiar with my recent work, you know that I created a service called Warrick which uses the WI to reconstruct lost websites. Warrick is used to reconstruct over 100 websites a month.

So what happens if you were to create a ReM, and then the resources you pointed to disappeared or changed? How can we ensure the ReM is accurate at various points in time over its lifetime?

Michael Nelson, my former PhD adviser, suggested we leverage the intelligence of web users to preserve ReMs. If a large community of users put effort into creating and caring for Wikipedia articles, would a community of users also care for ReMs? And could we use the WI in conjunction with the small, distributed actions of this community to curate ReMs?

So this summer I've been building a prototype system called ReMember which demonstrates how we could get general web users to preserve ReMs in the WI. I've written a paper about it (still under review) which I'll make available soon. I'm not sure if I'll be able to make the prototype available to the public, but you can see a screenshot below which shows several aggregated resources from a ReM.


Users are requested to click on resources that are 404 or have undergone significant change. They can use a search engine like Google to find the new location of the missing resource, and they can push copies of resources into the WI. ReMember allows a user to view resources as they existed at various times throughout the lifetime of the ReM.

If you find any of this stuff interesting and want to know more, send me an email.