Using SWORD for Deposit

I’ve been looking for suitable front-ends for Fedora. The default installation comes with both a Java administrator client and a newer (though less full-featured) web administration tool. Developers are currently trying to transition over to the web client, adding functionality with each point release. Neither of these clients are suitable as comprehensive front-ends for Fedora however.

I identified a deposit interface as the foremost component for the front-end, and quickly went looking to the SWORD APP for a solution. SWORD (Simple web-service offering repository deposit) is a profile of the Atom Publishing Protocol (APP), a successful syndication specification for items on the web. SWORD retools the protocol a tiny bit and emphasizes the request and deposit functions over functions like delete or update. APP is a HTTP-based protocol, so of course SWORD functions over the web. The idea is that a simple protocol like this will find widespread use and acceptance, bolstering the awareness and bulk of repositories by simplifying the deposit process, all while keeping it as a remote function. For instance, there’s SWORD Facebook app and a (sample) SWORD plug-in for Microsoft Office. Ideally from either of these platforms you can send your work right away to any number of receiving repositories.

I have setup SWORD with Fedora. It runs as a web application inside Tomcat. After a few kinks everything seems to be ironed out, and it’s a little more clear how SWORD could fit into the repository as a whole. SWORD is best at depositing content, naturally, and it’s best if that content is pre-processed. The out-of-the-box demonstration client isn’t going to provide a way add metadata to your deposit.

EasyDeposit, written by Stuart Lewis, looks very promising in this area. EasyDeposit is a PHP front-end to SWORD, and allows the user to go through several steps prior to delivering the deposit. It allows the administrator to adjust, add, delete, and create steps, tailoring the process to the organization. For our purposes, we would add steps to receive metadata information about hardware, bibliographic materials, software and so on. I’ve implemented EasyDeposit on our test instance and it does safely deposit content along with some default metadata fields. Concerning the steps already present, configuration is straightforward, requiring modification of .php files. It should be easy to add in collections, or content models that define metadata and behavior requirements, and present them in these steps. Unfortunately (for me), EasyDeposit is presently at 0.1 (although it just got support for the CrossRef API), and documentation is not all there. This doesn’t put EasyDeposit completely out of the running per se, as the interface looks great, and the idea of customizing a template is pretty attractive.

More broadly, SWORD will not serve as an entire front-end solution, and the question becomes whether one wants to join a SWORD interface like EasyDeposit with other Fedora-compliant components like searching and disseminating. Alternatives to this approach are projects like Muradora and Islandora that attempt to provide more full-featured front-end. I plan to explore these options and get a better idea of the possibilities for full implementation.

Ingesting a Content Model

I wanted to briefly post a Content Model for the repository. This is the “Common Metadata” Content Model, and it has been adapted from the Hydra Project. A digital object can be associated with a content model before, during or after ingestion. The content model in turn can point to a Service Definition object, which defines certain services for the digital object. Those service are in turn specifically defined in a Service Deployment object. The SDep object defines these services with the Web Services Description Language (WSDL). WSDL is an XML format for specifically detailing a web service. Documentation for Fedora 3 states that “Notably, Fedora currently only supports performing disseminations via HTTP GET.” This should be fine for our purposes, and it should make our .wsdl file pretty straightforward.

This is certainly a lot of layers involved in making Fedora really do something, but the modularity is key when you need to make changes on your server. If the port designations for your server have changed, you only need to change a few .wsdl files. If you need to assign new services to a new type of digital object, just add those service definitions to your content model. At least, I hope it’s that simple.

<?xml version="1.0" encoding="UTF-8"?>
<foxml:digitalObject VERSION="1.1" PID="gcm-cModel:commonMetadata"
xmlns:foxml="info:fedora/fedora-system:def/foxml#"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="info:fedora/fedora-system:def/foxml# http://www.fedora.info/definitions/1/0/foxml1-1.xsd">

<foxml:objectProperties>
<foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="Active"/>
<foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="Common metadata model"/>
<foxml:property NAME="info:fedora/fedora-system:def/model#ownerId" VALUE="fedoraAdmin"/>
</foxml:objectProperties>

<foxml:datastream ID="RELS-EXT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
<foxml:datastreamVersion ID="RELS-EXT.0" LABEL="External relations" MIMETYPE="text/xml" SIZE="448">
<foxml:xmlContent>
<rdf:RDF xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/gcm-cModel:commonMetadata">
    <fedora-model:hasModel rdf:resource="info:fedora/fedora-system:ContentModel-3.0"></fedora-model:hasModel>

    <!--
    A key line here. This says that objects associated with this Content Model will have this service. In this case, since this is a "Common Metadata" Content Model, the service it points to is "gcm-sDef:commonMetadata."
    -->

        <fedora-model:hasService rdf:resource="info:fedora/gcm-sDef:commonMetadata"></fedora-model:hasService>

    </rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>

<foxml:datastream ID="DS-COMPOSITE-MODEL" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
<foxml:datastreamVersion ID="DS-COMPOSITE-MODEL.0" LABEL="DS composite model" MIMETYPE="text/xml" SIZE="780">
<foxml:xmlContent>
<dsCompositeModel xmlns="info:fedora/fedora-system:def/dsCompositeModel#">
  <dsTypeModel ID="DC">
    <form MIME="text/xml"></form>
  </dsTypeModel>
  <dsTypeModel ID="RELS-EXT">
    <form MIME="text/xml"></form>
  </dsTypeModel>
  <dsTypeModel ID="descMetadata">
    <form MIME="text/xml"></form>
  </dsTypeModel>
</dsCompositeModel>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>

</foxml:digitalObject>

This content model points to one service call (info:fedora/gcm-sDef:commonMetadata). This service definition object will (through a SDep) deliver metadata common to all the objects in the repository: title, type (from the DCMI vocabulary), type (from our own categories), creation date and general description.

Finally, this content model is missing two compulsory datastreams, a basic Dublin Core datastream and an auditing one. These are added by Fedora automatically when the object is ingested.

Instance of Fedora 3.3 Up

Whew. That was intense.

A bare bones installation of Fedora 3.3 is up and running on the GCM server. Address is http://10.10.24.35:8080/fedora if you’re in Goodwill’s network, authentication required. Installation was actually relatively painless, and took about 1/30 the time I spent setting up a default install of DSpace last year. I’m not sure why this is, DSpace is suppose to be the more plug and playable repository software. A key difference here: there is absolutely nothing you can do with Fedora right now; with DSpace you could start building communities and collections from the start.

Fedora is using our regular old MySQL instance to store its information. This will work fine. The next step will be to set up some basic services and become familiar with FOXML, the XML format Fedora uses to describe its digital objects. From there we can begin creating our schema and doing some simple test runs. When this is going smoothly enough, we can think more about putting in hooks into the Semantic Web (such as it stands) by pointing to elements in Dublin Core, Friend of a Friend, or other ontologies.

One final prerequisite that does need to get up as soon as possible is regular backup of the server. Our present MySQL database is mailed weekly to a couple different computers, so that data is fairly secure. But the drive running our server has no redundancy, so we’re not as secure as we should be.

Notes on the Open Archival Information System (OAIS)

Back in 2002 Consultative Committee for Space Data Systems made a recommendation to the ISO for an Open Archival Information System. The recommendation has found broad acceptance and varying levels of compliance are usually elaborated upon in the digital repository software packages like DSpace. Since we want our archive to have a future as a federated or cooperating (OAIS terms) archive, and since the terminologies and concepts created in this document are widespread, I decided to take some notes on the recommendation as they relate to potential metadata elements we’ll employ.

The recommendation mostly concerns itself with the long term preservation of digital objects, although the framework incorporates metadata for physical objects as well. Broadly, OAIS defines an Information Object as a Data Object coupled with its Representation Information. The Representation Information allows a person to understand how the bits in the Data Object are to be interpreted. An example would be a TIFF file (Data Object) coupled with an ASCII document (Representation Information) detailing the headers, its compression method, etc., like here: TIFF description at Digital Preservation (The Library of Congress). Of course, one might also want Representation Information for the ASCII file, to explain how characters are interpreted in that format. OAIS terms this phenomenon recursive Representation Information and one might eventually accrue a Representation Network of such digital objects. One stops when the Knowledge Base of your Designated Community has the requisite knowledge to understand your top-most piece of Representation Information.

OAIS defines two types of Representation Information: Structure and Semantic. Structure Information describes the data format applied to the bit sequence to derive more meaningful values like characters, pixels, numbers, etc. Semantic Information describes the social meaning behind these higher values (for example that the text characters are English).

OAIS discourages using software that can access and use Data Objects as a replacement for comprehensive Representation Information. Although that would serve the end user well enough for a time, the software itself naturally poses its own obsolescence problem. Of course, the digital media we would like to preserve is mostly software itself. We may have datasets, images, scans, etc., but the majority of digital assets we hold are complete software packages. This includes operating systems, office suites, computer games, console games (on cartridges) and so on. Retrieving Representation Information for all these types of software will be a considerable and ongoing task, as most software will consist multiple file types.

Continue reading “Notes on the Open Archival Information System (OAIS)”

Interview Notes

Future use of GCM’s digital archive can be divided into internal and external realms. Internally the archive will serve as a database of holdings. Inventory reports, audits, and storage organization are likely to be the primary purposes for internal reference. Since the museum is seeking accreditation by the American Association of Museums, the archive’s metadata should accomadate this goal. An area of the accreditation that falls into the scope of the database aspect of the archive is specified in AAM’s Commission’s Expectations Regarding Collections Stewardship [.pdf] document, which requires:

a system of documentation, records management, and inventory is in effect to describe each object and its acquisition (permanent or temporary), current condition and location, and movement into, out of, and within the museum

Note that an “object” is here defined as “materials used to communicate and motivate learning and instruments for carrying out the museum’s stated purpose,” so the database should track accessioned and non-accessioned materials. The latter is termed “ephemera” in our present database.

In the external realm, one foresees museum members and researchers availing themselves of an online GCM digital archive. This might be for general interest or for a specific project or research topic. Possible documents of interest would be detailed photos of machines and software, manufacturing information such as the number of machines produced in a certain model line and original price points, design documents (both institutional and individual) for machines and software, and manual texts. Where possible one will want to link to official specifications of software and hardware (filesystem specifications, manufacturer’s notes, etc.).

It’s not foreseen that internal components of hardware will receive extensive metadata as a priority. In this category are motherboards, sounds cards, network cards, hard drive, disk drives, etc. In the case of complete computer systems information on the internal components is readily available elsewhere. This leaves little reason to tear down the system and verify internal components unless doing so yielded particularly valuable information. In the case of incomplete or non-standard computer systems (systems custom-built, upgraded or downgraded) documentation of internal components might occur where those components deviate from the manufacturer’s norm.

Russ discussed some of the difficulty in labeling a manufacturer’s different model lines of hardware. For instance the silicon rails on chips can differ by millimeters or nanometers. When a manufacturer is able to tighten the distance between rails, a new model or version of a chip may go into production. Such a change is recorded by the manufacturer, but it may be quite difficult for the museum to measure this and thus record that piece of manufacturing data. The issue of what manufacturing specifications should be recorded into the the museum’s database is a key issue that will need to be resolved.

PAWN and Producers

The Producer-Archive Workflow Network (PAWN) is a platform for handling the ingestion of artifacts into a long term digital repository like Fedora of DSpace. As such it focuses on the Producer-Archive interaction portion described in the Open Archival Information Systems [pdf]. It strives for flexibility in accommodating different producer-archive relationships, most likely found in a distributed system. For example an archivist or repository manager may use PAWN to handle disparate types of data or package producers (manufacturers, individual scholars, students, etc.) who are all going to have different metadata to fill out before submitting the package for ingestion, and for which the processing may be different in regards to individual metadata elements. PAWN is part of a larger tool set being developed by ADAPT (An Approach to Digital Archiving and Preservation Technology).

PAWN seems most applicable in a distributed repository that sees submissions from a variety of different producers. It’s unlikely the Goodwill Computer Museum would need such flexibility for its own repository. That repository will be fairly centralized, only maintaining multiple clients in the building (and perhaps a few remotely in time). Our producers will always be staff or trained volunteers. But the project does highlight that the museum will be seeing submissions from at least two different ‘producers’: the recycling department, individual donations, and perhaps institutional donations.

The recycling department ‘producer’ effectively makes the real producer anonymous. The only exception would be provenance information found on or in the artifact itself (stickers, names in books, disk storage, etc.). Despite the presence of such information, I can’t imagine the museum would be able to use it, as very likely it is outside of the recycling departments’ right to disburse such information.