Excerpt: No-nonsense Guide to Born-digital Content

I thought I’d put up an excerpt from my book published earlier this year with Heather Ryan, The No-nonsense Guide to Born-Digital Content. This is from Chapter 4, “Acquisition, accessioning and ingest.” It’s probably the chapter I most enjoyed drafting as it forced me to break down the rationale for various means of acquiring born-digital content. In the case of disk imaging, that meant going back to “first principles” – such as it is – for forensic processing, which was pretty fun to work out. This excerpt covers that, but the chapter goes on to discuss capturing web sites, email and social media.


Acquisition, accessioning and ingest

Now we have discussed major appraising and collecting concerns, let’s get into the specific work of acquiring, accessioning and ingesting born- digital content into an archives or library collection. Before we do so, bear with us as we define these terms and cover the high-level principles that guide born-digital acquisitions. For our purposes here, acquisition refers to the physical retrieval of digital content. This could describe acquiring files from a floppy drive, selecting files from a donor’s hard drive or receiving files as an e-mail attachment from a donor. In all cases, you now have physical control of the content. Accessioning refers to integrating the content into your archives or collections: assigning an identifier to the accession, associating the accession with a collection and adding this administrative information into your inventory or collection management system. Ingest is the process of placing your content into whatever repository system you have for digital content. This could be any solution from a simple but consistent file and folder structure to a full stack of storage and content management software. While each of these processes can sometimes be done in a single step, we will cover each in turn, and by the end of the chapter you should have a comprehensive overview of the initial steps in any born-digital workflow.

Principles of acquisition

The principles of acquisition of born-digital content flow from the general guidelines for acquiring all archival material. Yes, that’s right: respect des fonds is still the guiding light for handling born-digital content. The principle, issuing from Natalis de Wailly at the Royal Archives of France in 1841, advises the grouping of collections by the body (roughly, the ‘fonds’) under which they were created and purposed. The two natural objectives flowing from respect des fonds are the retention of both provenance and original order. These objectives really get to the heart of our concerns in acquiring born-digital content – we want to avoid tampering with original order or losing sight of provenance each step of the way.

It can be surprisingly tricky to do this, and there have arisen a multitude of tools and techniques in the digital space to facilitate best practices. The whole field of digital forensics, as it has been adopted by the archives and cultural heritage communities, is designed to maintain original order and provenance. In general, the tools and techniques to do this come from the law enforcement and criminal investigation fields’ overriding injunction to never alter evidence. Digital forensics comes into play for some types of born-digital acquisition, but by no means all, so we will cover other relevant acquisition techniques too.

Acquisition of born-digital material on a physical carrier

Forensics as practised by libraries and archives is most often centred on the acquisition of digital content from physical digital media items – floppy disks, data tapes, spinning-disk hard drives, USB drives (and other solid-state drives) and optical media. The number of collections containing this material is growing. To take a recent example, the University of Melbourne began working with the floppy disks and computers of the Australian writer Germaine Greer. This media contains voluminous amounts of her work and life, much unpublished and unseen, including floppy disks containing an unpublished book and two Mac computers – along with a hefty born-digital love letter to a fellow writer. That is a significant store of information located on digital media.

Getting material off these types of devices presents us with a few concerns. By virtue of the file system technology residing on these media, we have in our hands not just the donor’s content, but also the context and environment of that content – the material’s original order as it came to the library or archives, along with traces of the user’s activity, potential remnants of past data, and system files and features not revealed to the regular user. All of this information can also constitute aspects of provenance as well, giving us information on the origin and ownership of the material. Therefore, simple copying of data from these devices is equivalent to disregarding this contextual information. This is the first cause for forensic technology in libraries and archives. Our second concern is that the mere attaching of this media to a computer, whereby that computer recognises the media attached to it, typically initiates a series of unprompted adjustments to the media by the computer. We will cover these adjustments shortly, but our takeaway here is that the simple connecting of a media item to a computer for copying can constitute the destruction of the contextual information as well, not just the benign disregard of it. This is the second cause for forensics in the field.

The file system

Because nearly all the contextual data that an archivist would be interested in is present in the file system of a media item, we will take a moment to describe the role of the file system. The file system is the main organizing system that an operating system uses to arrange the files and folders, both physically and through name assignment, on a disk. Similarly to a physical filing system, which provides organization for paper files, records and forms, a digital file system is a prescribed way for an operating system to organize the data on a computer. It keeps track of where data is on a disk, along with all the names you give to your files and folders. Along with these basic organizing tasks, the file system usually provides ways for the operating system to record dates and times associated with the files, such as the last time a file was modified, when it was created, the last time a file was opened and, in some cases, when the file was removed.

A file system is a piece of software that is subject to additions, refinements, reworking and other innovations, so any given file system will have different capabilities from another file system. Prominent file systems are and have been:

  • the FAT file systems, in use in most older MS-DOS and Windows machines, and frequently used as a common lingua franca for portable drives – nearly every operating system understands FAT. We discussed FAT in the first chapter to give you an idea of how information can be organized on magnetic media.
  • the New Technology File System (NTFS), a successor file system to FAT developed by Microsoft and present in most Windows versions post-Windows NT (1993).
  • the Hierarchical File System (HFS), used in early Apple machines and media, its successor HFS+ and HFSX, in use in all modern Apple systems, and the Apple File System (APFS), a very recent successor (2017) to run across Apple’s iOS and macOS machines.
  • the first, second, third and fourth extended file systems (ext1, ext2, ext3, ext4) commonly run in Linux distributions.

You will probably run across one or more of these file systems in your own work, and may learn more about them as you require. For now, just demystifying all those initialisms should be enough.

Write blocking

Now that you have a sense of the important contextual data that the file system has knocking around on a piece of media, let’s cover how that data can be tampered with if write blocking is not used; i.e. what can a computer write to an attached piece of media? We can’t give a definitive list because these processes run in the background, unprompted by the user, and are subject to change based on the host computer’s operating system and program updates. Nevertheless, here are a few examples of the tasks that software on a host computer may do upon detecting attached media.

  • An operating system may read data from the files on the media, adjusting the access times to these files. It may do this to scan for viruses or to create a searchable index of the attached media’s content for your convenience.
  • An operating system may create temporary or hidden system folders and files on the attached device in order to help it manage the media.
  • If the file system on both the host computer and the attached media is a journaling system– a file system that writes down what it is going to do before it does it – the host machine may scan for incomplete journal entries and complete or remove those entries for the attached media, again adjusting the received state of the device.

This is a short list, but we ultimately don’t know, nor can we remain always knowledgeable, about what our work machines may do with the media we attach to them. Operating systems, the software on them and file systems themselves are changed and updated over time. For example, an antivirus suite may make adjustments to an attached media device. This is all the more reason to use write blocking whenever we reasonably can.

So, what is write blocking? Write blocking is the process of blocking all write commands to any partition or device. Older floppy disk media actually have this functionality built into the floppy disks themselves, in the form of either an adjustable read and write tab in 3.5” disks or a physical notch in 5.25” and 8” floppy disks. Floppy disk drives will observe for these physical attributes and block writes appropriately. If you encounter CD-Rs or CD-ROMs in your collection, these newer media will also be write blocked as well.

However, much media remains open to writes, including hard drives, USB and other solid-state drives and tape media. The primary way to avoid writes to this type of media is the use of a write blocker, also termed a forensic bridge. These devices sit in between the target media (the media you are attempting to acquire data from) and the host machine (the computer you are using to acquire this data). They detect any write commands issuing from the host machine and either directly block these write commands, or bank the write commands in their own store while preventing the commands from applying to the target media. While it is a subtle adjustment in strategy, in either case writes from the host machine are blocked from the device. Write blockers can provide other useful features, such as detection of hidden partitions, block counts and a tally of bad sectors. Perhaps foremost among the extra functionality is the simple ability to connect a hard drive to the host machine, as write blockers come with an array of ports designed to attach to both the host machine and the target media. Write blocking and facilitating a physical connection to your host machine make these devices a critical piece of your acquisition workflow if you are encountering any number of writable media.

Write blocking devices can be replaced with a software write blocker as well. This software is run on the host machine and attempts to intercept and block write commands issuing from the operating system or other software. In some cases, this may be your only option, due to either budget constraints or unique physical circumstances. Nevertheless, software write blockers should be considered a secondary and less desirable solution to the problem. This is because such software is less reliable than write blocking devices, as it must anticipate and detect all processes on the host machine, as well as any other software, that may attempt a write on the target media. By contrast, a write blocking device need only react to any write commands issued, regardless of origin, and block those commands. The software write blocker’s task is more difficult, as operating systems are not always fully documented or open source, nor are they static systems. Along with that, a user could install any number of programs, such as virus detection software, of which the software write blocker may not be properly aware.

Whether you are encountering media that is already write-only or media that remains writable, it is important to take the best steps you can to ensure write protection: check the tabs and notches on your floppy disks, verify that your CDs are in fact CD-Rs or CD-ROMs (although burning data to optical media is usually a multistep process and is less likely to occur without your knowledge) and use write blocking devices for other media – or if you must, a software write blocker instead. Taking this precaution will give you the best chance of preserving all the information on that media for future use!

The right drive for the right media

The second aspect of correctly attaching your target media to the host machine is to have the right equipment in place. It goes without saying that if you have a 5.25” floppy disk in hand, but no 5.25” drive, you will have an extraordinarily difficult time accessing data from that disk! In some cases, such as CDs and DVDs, acquiring a disc reader will not be too much of an imposition on either your budget or your time, as such readers are still commonly available from vendors and relatively inexpensive. As we move back further into computing’s early years, locating and purchasing older drives becomes a little more cumbersome – but by no means out of reach. In the case of floppy disks, you will need to search for used drives. At the time of writing, eBay is still a good resource for locating these drives, and they are still within most budgets. 5.25” drives may cost as little as US$100 (c.£81.00), and 8” drives are often not too much more. If you can find a seller who indicates that the drive has been tested and calibrated, so much the better. There are also vintage drive vendors to be found online.

In the case of 3.5” drives, the market still has many USB-connected 3.5” drives. This may be appropriate if the 3.5” floppies are of the more modern, IBM-PC formatted variety, but you should be aware that floppy disks featuring older formats, such as Apple DOS, cannot be read by the USB-connected drives. These drives are only able to read IBM-PC formatted floppy disks. For this reason, you will probably want to acquire an actual vintage 3.5” drive, which can be used with a floppy disk controller to read a range of floppy disks. We will cover floppy disk controllers in the next section.

There are many media formats, from high-capacity (for their time, of course) Zip disks to mini optical discs and numerous data tape cartridges with countless others spanning both the USA and European industries, and many others across the world. It is beyond the scope of this book to cover every possible media format and the drives available to read from them, but be aware that there are many resources online to get started in this area. Both vintage computer enthusiasts and practitioners in the information science field will be of great help here.

Floppy disk controllers

Computers that were released with floppy disk drives, or that did not ship with drives themselves but were simply released during the era of floppy disks, all contained floppy disk controllers – a chip board with the circuitry to read and write from a floppy. Modern machines, of course, do not ship with floppy disk controllers. You may anticipate, then, a significant problem after acquiring your vintage floppy disk drive, as you will have no way to connect it to the host machine – and even if you did, the host machine may not be able to operate the attached floppy disk drive.

For vintage computing enthusiasts, for practitioners in the archives and library field or for individuals who simply want to rescue their personal data off old floppies, there has emerged a small market for modern floppy disk controllers – chip boards designed to connect to a modern machine via USB and facilitate a connection to an old floppy disk drive. Two prominent entries in this field are the FC5025 by Device Side Data and the KryoFlux, put together by the Software Preservation Society. The former is strictly for connection with 5.25” drives, while the latter can operate 5.25”, 3.5” and even 8” drives (8” drives do require additional equipment, however). Again, it is beyond the scope of our overview here to delve into the use cases and specific procedures for these floppy disk controllers, but suffice to say that you will want to acquire these devices after you have secured your floppy disk drives. The particular board depends on the drives and disks you need to process, though the KryoFlux is the most flexible single device in this space. An added benefit is that these floppy disk controllers will also serve as write blocking devices – it never hurts to be extra cautious!

A breather!

We have covered a lot and we have not yet reached the point where we are actually copying data from the target media! So, before we touch on that last step, here is a review of what we have established so far.

  1. In born-digital acquisition, respect des fonds is still in play, so we want to capture original order, provenance and other contextual information.
  2. Merely attaching target media to a host machine can alter the target media and remove the desired contextual data.
  3. In addition, simply copying the files off the target media will not capture these important pieces of information.
  4. Because of 2, we want to practise write blocking with our target media.
  5. Because of 3, we want to take full disk images of the target media instead of the more common operation of file copying.

Hopefully, summaries 1–4 make sense now. To close out our section on born-digital materials on physical carriers, let’s talk about disk images.


The No-nonsense Guide to Born-digital Content is out now by Facet Publishing:

 

First set of NSIDC glacier photos up

nsidc_glacierPhotos_AGS_F_1941_107
Field, William Osgood. 1941. Muir Glacier: From the Glacier Photograph Collection. Boulder, Colorado USA: National Snow and Ice Data Center. Digital Media.

The first set of glacier photos from the Roger G. Barry Archives are up now at CU Boulder. There are 950 photos here, and that is a fraction of the 30,000 in the collection. More will be added over the year. This is a great resource for those interested in glaciology and climate change – and many are stunning images regardless. Again, thank you to CLIR, and everyone at CU Boulder, that have been so critical to the work.

Book Out: The No-nonsense Guide to Born-Digital Content

9781783301959

I have a new book out with my colleague Heather Ryan, The No-nonsense Guide to Born-Digital Content

I started drafting chapters for this book in late 2016 when Heather, then the head of the Archives here and now director of the department, approached me about coauthoring the title. I had never written in chapter form before, nor for more a general audience. Approaching my usual stomping ground of born-digital collection material from this vantage was really intriguing, so I jumped at the chance.

To back up a little, our subject here is collecting, receiving, processing, describing and otherwise taking care of born-digital content for cultural heritage institutions. With that scope, we have oriented this book to students and instructors, as well as current practitioners who are aiming to begin or improve their existing born-digital strategy. We’ve included lots of real world examples to demonstrate points, and the whole of the book is designed to cover all aspects of managing born-digital content. We really discuss everything from collecting policy and forensic acquisition to grabbing social media content and designing workflows. In other words, I’m hoping this provides a fantastic overview of the current field of practice.

Our title is part of Facet Publishing’s No-nonsense series, which provides an ongoing run of books on topics in information science. Facet in general is a great publisher in this space (if you haven’t checked out Adrian Brown’s Archiving Websites, I recommend it), and I’m happy to be a part of it. I thank them for their interest in the book and their immense help in getting it published!

Update: The book is now available stateside in the ALA store.

Rearview: Wolfenstein 3D

Wolfenstein 3D Postcard
Greetings from Castle Wolfenstein – wish you were here!

Wolfenstein 3D is a gaming divide. Id’s first-person shooter introduced an experience unlike most console or computer games prior: it produced shock, suspense, disorientation and novelty. Super Mario Bros., Elevator Action, Life Force, Contra: a host of NES games felt like oddities and diversions while id’s shooter stood alone as a tense, horror gauntlet – when it wasn’t an action gala.

Reflections on the title describe the unrivaled immersive play. Id tipped the nascent action first-person genre in its favor with title, and then basically set it on its course for the next 20 or 30 years with Doom. I submit that first-person shooters are still in the shadow of that title (caveat: haven’t played many recent FPSs lately). First-person shooters which are praised for their characterization, story, atmosphere, nuance and art still never fail to deliver genre staples: adversaries to gun, the armory itself, and the gameplay to make this very straightforward action fun and interesting. Modern games have succeeded, fitfully and partially I would argue, in subverting this genre expectation, but it’s never absent. I suppose such is the nature of genre – they are shooters, after all – but still, on a very real level playing a seminal shooter like Bioshock still feels like playing Doom. And for all the expertly and artfully realized world that a game like Bioshock offers – among the very best I think – it still handles like (a slower) Doom, through and through. One begins to wonder how much artfully realized world-building and narrative the FPS genre can bear.

To widen the scope a bit I would like to point out a review of Masters of Doom (2003) by James Wagner Au. Au considers what first-person genre would have been like, and more broadly computer games at all, had id not so dominated the landscape with its games, which, for however polished their gameplay is, are steeped in frenetic action and general mayhem.

Corridor Jitters

During the first few weeks I had trouble sleeping, and I can recall a nightmare where brown-shirted guards hunted me down long, open corridors, and where each of my evasions revealed a new nest of Nazis screaming out in German some declaration of annihilation. For a time, the game was genuinely frightening and horrific.

As I became habituated to Wolfenstein’s atmosphere, a lot of that fear receded. I got familiar with the milieu of enemies and occasionally anticipated the designers’ placement of them. I became accustomed to their zig-zag approaches and with the keyboard controls themselves.

But the tension of play remained. To this day it is still an intense game. Enemies’ screams and cries are over the top but striking and memorable. Level design is often punishing, forcing you down corridors lined with tiny recessions where an enemy may or may not stand waiting, or sticking you in a maze populated entirely by silent opponents – your first indication of their presence is being fired upon, usually at point blank.

The damage model is much closer to Counter-Strike than Doom or “Doom clones.” It isn’t uncommon to die from just a handful of enemy hits. A single shot from the lowliest guard can take you from 100% to 0%. And there isn’t any armor. You get into a lot of survival situations, long and short moments alike where you know a single false step will end it. On top of this, Wolfenstein’s engine only allowed 90 degree angles, so peeking around hiding spots was quite difficult, and as often as not you would end up exposing your back to another stuffed-away guard.

It’s known that Tom Hall, one of the designers of the title, pushed for a more realistic game, one the attempted to responsibly portray Nazi prisons, offices and barracks as the game engine allowed. Some levels do flirt with that realism, but most, and the most memorable, do not. The characteristic Wolfenstein 3D level is one with a few reasonable rooms, but a whole swath of spaces that no architect would have ever constructed for any reason, transparently set to rattle you. A player can nearly hear the level designer rejoining, “Sure, and how about this?”

Legacy of Brutality

On the game’s violence: it’s true that it’s unremarkable by a modern standard, but this has more to do with graphic fidelity than anything else. It’s also true that that gore would be recycled and majorly amplified in Doom, itself quite outdone by the realism of subsequent shooters. But Wolfenstein 3D started it.

I can’t forget the audaciousness of this gore at the time. Id’s previous big seller was Commander Keen, a title that delivered the kid-oriented, all-ages console platformer to the PC. The Catacomb series, while first-person shooters, were strictly fantasy, where one aimed fireballs at neon-colored demons and skeletons. We weren’t talking about knifing guard dogs and gunning Nazi officers. The realism, and moreover the evident desire for realism (if not in level architecture, then in common character design), was not there.

I agree with Au when he detects something personally cathartic for the creators within the gore of Doom (and I extrapolate, to a lesser extent, within Wolfenstein 3D). One can certainly chalk up the morbidness to adolescence, but there’s a seriousness and realness to the depiction (if not the treatment) of bodily harm that suggests, as Au states, that Adrian and others were invested in this production much more than the flippant dismemberment and ludicrous body-splosions of successor games (Rise of the Triad, for example).

A couple of bits to know about that art in Wolfenstein 3D and Doom. For Wolfenstein, Adrian Carmack drew all sprites by hand. That means that each pixel was placed by him, and considered. And, as my interview with Bobby Prince suggests, space restrictions may have forced Adrian to return to his pixels and decide what could be cut and altered. The pixels you see in Wolfenstein are each considered. That means the dramatic descent of the officer character, say, was really worked over and decided upon.

 

Imp Pixels
The Doom Imp: Pixel by pixel

For Doom, the characters were first modeled by hand in clay, then photographed from multiple angles. Those photos were scanned, where Adrian could then begin his pixel rendition. So again, the creation of these sprites is a manual process. The computer was the tool, but it took the personal hand and labor to make the images.

Quake I - fighting knights
Physics?

Compare that to id’s own Quake. No doubt there’s labor and consideration here, but the violence has less impact in this title. When an opponent spills blood, it’s done in obvious arcs of red pixels, more the outcome of an algorithm than any person’s hand. When they explode, parts go off in mathematical arcs. Bodily harm and violence is more a computational system, not an animation articulated by hand, frame by frame and pixel by pixel. The graphic violence in Wolfenstein 3D and Doom was affecting and disturbing in a way subsequent efforts were not.

The Castle

Shooters were and still are narrow in conception and in play, and Wolfenstein 3D relies on the quick succession of tiny victories over a near-constant crisis to engage you. No matter how enjoyable this is, the six episodes in the full release are far too much game for the play mechanics at hand. Naturally, the current franchise entries now feature the same commitment to action, visceral combat and slaughter, but has added story, extensive world building, legitimate characters and high production values.

The original shooter though remains an odd beast: a sparse dungeon, remarkably isolating, where nearly every, infrequent glimpse outside reveals a total night. Moving through the game’s endless and often identical halls gives way to a repetitiveness that moves from seductive, to tedious, to fatal. It’s a place of cheap, cartoonish horrors that turn real if you let your guard down. Welcome to Castle Wolfenstein.

“Revealing our melting past: Rescuing historical snow and ice data”

For the last year I have served as Co-PI for a fantastic project, supported by CLIR’s Digitizing Hidden Special Collections and Archives grant program, which centers on the metadata gathering and digitization of the National Snow and Ice Data Center’s (NSDIC) expansive collection of glacier and polar exploration prints within the Roger G. Barry Archives here in Boulder. We have a stellar project archivist leading the work, and we expect to begin posting images on our own site over the course of the year. Stay tuned for that.

The linked article here, posted in the last (ever, actually) issue of GeoResJ is a good summary of the project scope and value from everyone on the team, including our initial PI now at the University of Denver. We’re really excited to be contributing along with NSIDC to glaciology and earth history through this collection, and are planning on further promotion as processing continues along.

Revealing our melting past: Rescuing historical snow and ice data
Author links open overlay panel (ScienceDirect) (CU Scholar)

“Aggregating Temporal Forensic Data Across Archival Digital Media”

Last year I attended the Digital Heritage 2015 conference and presented a paper on digital forensics in the archive. The paper centers on collecting file timestamps across floppy disks into a single timeline to increase intellectual control over the material and to explore the utility of such a timeline for a researcher using the collection.

As I state in the paper, temporal forensic data likely constitutes the majority of forensic information acquired in archival settings, and in most cases this information is gathered inherently through the generation of a disk image  While we may expect further use of this data as disk images make their way to researchers as archival objects (and the community’s software, institutional policies and user expectations grow to support it), it is not too soon to explore how temporal forensic data can be used to support discovery and description, particularly in the case of collections with a significant number of digital media.

Many thanks to the organizers of Digital Heritage 2015 for the support and feedback; it was a wonderful and very wide-reaching conference.

Aggregating Temporal Forensic Data Across Archival Digital Media (IEEEXplore) (CU Scholar)

KryoFlux Webinar Up

In February, I took part in the first Advanced Topics webinar for the BitCurator Consortium, centered on using the KryoFlux in an archival workflow. My co-participants, Farrell at Duke University and Dorothy Waugh at Emory University both contributed wonderful insights into the how and why of using the floppy disk controller for investigation, capture and processing. Many thanks to Cal Lee and Kam Woods for their contributions, and Sam Meister for his help in getting this all together.

If you are interested in using the KryoFlux (or do so already) I recommend checking the webinar out, if only to see how other folks are using the board and the software.

An addendum to the webinar for setting up in Linux

If you are trying to set up KryoFlux in a Linux installation (e.g. BitCurator), take a close look at the instructions found in README.linux text file located in the top directory of the package downloaded from KryoFlux site. It contains instructions on dependencies needed and the process for allowing access to floppy devices through KryoFlux for a non-root user (such as bcadmin). This setup that will avoid many permissions problems down the line as you will not be forced to use the device as root, and I have found it critical to correctly setting up the software in Linux.