I thought I’d put up an excerpt from my book published earlier this year with Heather Ryan, The No-nonsense Guide to Born-Digital Content. This is from Chapter 4, “Acquisition, accessioning and ingest.” It’s probably the chapter I most enjoyed drafting as it forced me to break down the rationale for various means of acquiring born-digital content. In the case of disk imaging, that meant going back to “first principles” – such as it is – for forensic processing, which was pretty fun to work out. This excerpt covers that, but the chapter goes on to discuss capturing web sites, email and social media.
Acquisition, accessioning and ingest
Now we have discussed major appraising and collecting concerns, let’s get into the specific work of acquiring, accessioning and ingesting born- digital content into an archives or library collection. Before we do so, bear with us as we define these terms and cover the high-level principles that guide born-digital acquisitions. For our purposes here, acquisition refers to the physical retrieval of digital content. This could describe acquiring files from a floppy drive, selecting files from a donor’s hard drive or receiving files as an e-mail attachment from a donor. In all cases, you now have physical control of the content. Accessioning refers to integrating the content into your archives or collections: assigning an identifier to the accession, associating the accession with a collection and adding this administrative information into your inventory or collection management system. Ingest is the process of placing your content into whatever repository system you have for digital content. This could be any solution from a simple but consistent file and folder structure to a full stack of storage and content management software. While each of these processes can sometimes be done in a single step, we will cover each in turn, and by the end of the chapter you should have a comprehensive overview of the initial steps in any born-digital workflow.
Principles of acquisition
The principles of acquisition of born-digital content flow from the general guidelines for acquiring all archival material. Yes, that’s right: respect des fonds is still the guiding light for handling born-digital content. The principle, issuing from Natalis de Wailly at the Royal Archives of France in 1841, advises the grouping of collections by the body (roughly, the ‘fonds’) under which they were created and purposed. The two natural objectives flowing from respect des fonds are the retention of both provenance and original order. These objectives really get to the heart of our concerns in acquiring born-digital content – we want to avoid tampering with original order or losing sight of provenance each step of the way.
It can be surprisingly tricky to do this, and there have arisen a multitude of tools and techniques in the digital space to facilitate best practices. The whole field of digital forensics, as it has been adopted by the archives and cultural heritage communities, is designed to maintain original order and provenance. In general, the tools and techniques to do this come from the law enforcement and criminal investigation fields’ overriding injunction to never alter evidence. Digital forensics comes into play for some types of born-digital acquisition, but by no means all, so we will cover other relevant acquisition techniques too.
Acquisition of born-digital material on a physical carrier
Forensics as practised by libraries and archives is most often centred on the acquisition of digital content from physical digital media items – floppy disks, data tapes, spinning-disk hard drives, USB drives (and other solid-state drives) and optical media. The number of collections containing this material is growing. To take a recent example, the University of Melbourne began working with the floppy disks and computers of the Australian writer Germaine Greer. This media contains voluminous amounts of her work and life, much unpublished and unseen, including floppy disks containing an unpublished book and two Mac computers – along with a hefty born-digital love letter to a fellow writer. That is a significant store of information located on digital media.
Getting material off these types of devices presents us with a few concerns. By virtue of the file system technology residing on these media, we have in our hands not just the donor’s content, but also the context and environment of that content – the material’s original order as it came to the library or archives, along with traces of the user’s activity, potential remnants of past data, and system files and features not revealed to the regular user. All of this information can also constitute aspects of provenance as well, giving us information on the origin and ownership of the material. Therefore, simple copying of data from these devices is equivalent to disregarding this contextual information. This is the first cause for forensic technology in libraries and archives. Our second concern is that the mere attaching of this media to a computer, whereby that computer recognises the media attached to it, typically initiates a series of unprompted adjustments to the media by the computer. We will cover these adjustments shortly, but our takeaway here is that the simple connecting of a media item to a computer for copying can constitute the destruction of the contextual information as well, not just the benign disregard of it. This is the second cause for forensics in the field.
The file system
Because nearly all the contextual data that an archivist would be interested in is present in the file system of a media item, we will take a moment to describe the role of the file system. The file system is the main organizing system that an operating system uses to arrange the files and folders, both physically and through name assignment, on a disk. Similarly to a physical filing system, which provides organization for paper files, records and forms, a digital file system is a prescribed way for an operating system to organize the data on a computer. It keeps track of where data is on a disk, along with all the names you give to your files and folders. Along with these basic organizing tasks, the file system usually provides ways for the operating system to record dates and times associated with the files, such as the last time a file was modified, when it was created, the last time a file was opened and, in some cases, when the file was removed.
A file system is a piece of software that is subject to additions, refinements, reworking and other innovations, so any given file system will have different capabilities from another file system. Prominent file systems are and have been:
- the FAT file systems, in use in most older MS-DOS and Windows machines, and frequently used as a common lingua franca for portable drives – nearly every operating system understands FAT. We discussed FAT in the first chapter to give you an idea of how information can be organized on magnetic media.
- the New Technology File System (NTFS), a successor file system to FAT developed by Microsoft and present in most Windows versions post-Windows NT (1993).
- the Hierarchical File System (HFS), used in early Apple machines and media, its successor HFS+ and HFSX, in use in all modern Apple systems, and the Apple File System (APFS), a very recent successor (2017) to run across Apple’s iOS and macOS machines.
- the first, second, third and fourth extended file systems (ext1, ext2, ext3, ext4) commonly run in Linux distributions.
You will probably run across one or more of these file systems in your own work, and may learn more about them as you require. For now, just demystifying all those initialisms should be enough.
Now that you have a sense of the important contextual data that the file system has knocking around on a piece of media, let’s cover how that data can be tampered with if write blocking is not used; i.e. what can a computer write to an attached piece of media? We can’t give a definitive list because these processes run in the background, unprompted by the user, and are subject to change based on the host computer’s operating system and program updates. Nevertheless, here are a few examples of the tasks that software on a host computer may do upon detecting attached media.
- An operating system may read data from the files on the media, adjusting the access times to these files. It may do this to scan for viruses or to create a searchable index of the attached media’s content for your convenience.
- An operating system may create temporary or hidden system folders and files on the attached device in order to help it manage the media.
- If the file system on both the host computer and the attached media is a journaling system– a file system that writes down what it is going to do before it does it – the host machine may scan for incomplete journal entries and complete or remove those entries for the attached media, again adjusting the received state of the device.
This is a short list, but we ultimately don’t know, nor can we remain always knowledgeable, about what our work machines may do with the media we attach to them. Operating systems, the software on them and file systems themselves are changed and updated over time. For example, an antivirus suite may make adjustments to an attached media device. This is all the more reason to use write blocking whenever we reasonably can.
So, what is write blocking? Write blocking is the process of blocking all write commands to any partition or device. Older floppy disk media actually have this functionality built into the floppy disks themselves, in the form of either an adjustable read and write tab in 3.5” disks or a physical notch in 5.25” and 8” floppy disks. Floppy disk drives will observe for these physical attributes and block writes appropriately. If you encounter CD-Rs or CD-ROMs in your collection, these newer media will also be write blocked as well.
However, much media remains open to writes, including hard drives, USB and other solid-state drives and tape media. The primary way to avoid writes to this type of media is the use of a write blocker, also termed a forensic bridge. These devices sit in between the target media (the media you are attempting to acquire data from) and the host machine (the computer you are using to acquire this data). They detect any write commands issuing from the host machine and either directly block these write commands, or bank the write commands in their own store while preventing the commands from applying to the target media. While it is a subtle adjustment in strategy, in either case writes from the host machine are blocked from the device. Write blockers can provide other useful features, such as detection of hidden partitions, block counts and a tally of bad sectors. Perhaps foremost among the extra functionality is the simple ability to connect a hard drive to the host machine, as write blockers come with an array of ports designed to attach to both the host machine and the target media. Write blocking and facilitating a physical connection to your host machine make these devices a critical piece of your acquisition workflow if you are encountering any number of writable media.
Write blocking devices can be replaced with a software write blocker as well. This software is run on the host machine and attempts to intercept and block write commands issuing from the operating system or other software. In some cases, this may be your only option, due to either budget constraints or unique physical circumstances. Nevertheless, software write blockers should be considered a secondary and less desirable solution to the problem. This is because such software is less reliable than write blocking devices, as it must anticipate and detect all processes on the host machine, as well as any other software, that may attempt a write on the target media. By contrast, a write blocking device need only react to any write commands issued, regardless of origin, and block those commands. The software write blocker’s task is more difficult, as operating systems are not always fully documented or open source, nor are they static systems. Along with that, a user could install any number of programs, such as virus detection software, of which the software write blocker may not be properly aware.
Whether you are encountering media that is already write-only or media that remains writable, it is important to take the best steps you can to ensure write protection: check the tabs and notches on your floppy disks, verify that your CDs are in fact CD-Rs or CD-ROMs (although burning data to optical media is usually a multistep process and is less likely to occur without your knowledge) and use write blocking devices for other media – or if you must, a software write blocker instead. Taking this precaution will give you the best chance of preserving all the information on that media for future use!
The right drive for the right media
The second aspect of correctly attaching your target media to the host machine is to have the right equipment in place. It goes without saying that if you have a 5.25” floppy disk in hand, but no 5.25” drive, you will have an extraordinarily difficult time accessing data from that disk! In some cases, such as CDs and DVDs, acquiring a disc reader will not be too much of an imposition on either your budget or your time, as such readers are still commonly available from vendors and relatively inexpensive. As we move back further into computing’s early years, locating and purchasing older drives becomes a little more cumbersome – but by no means out of reach. In the case of floppy disks, you will need to search for used drives. At the time of writing, eBay is still a good resource for locating these drives, and they are still within most budgets. 5.25” drives may cost as little as US$100 (c.£81.00), and 8” drives are often not too much more. If you can find a seller who indicates that the drive has been tested and calibrated, so much the better. There are also vintage drive vendors to be found online.
In the case of 3.5” drives, the market still has many USB-connected 3.5” drives. This may be appropriate if the 3.5” floppies are of the more modern, IBM-PC formatted variety, but you should be aware that floppy disks featuring older formats, such as Apple DOS, cannot be read by the USB-connected drives. These drives are only able to read IBM-PC formatted floppy disks. For this reason, you will probably want to acquire an actual vintage 3.5” drive, which can be used with a floppy disk controller to read a range of floppy disks. We will cover floppy disk controllers in the next section.
There are many media formats, from high-capacity (for their time, of course) Zip disks to mini optical discs and numerous data tape cartridges with countless others spanning both the USA and European industries, and many others across the world. It is beyond the scope of this book to cover every possible media format and the drives available to read from them, but be aware that there are many resources online to get started in this area. Both vintage computer enthusiasts and practitioners in the information science field will be of great help here.
Floppy disk controllers
Computers that were released with floppy disk drives, or that did not ship with drives themselves but were simply released during the era of floppy disks, all contained floppy disk controllers – a chip board with the circuitry to read and write from a floppy. Modern machines, of course, do not ship with floppy disk controllers. You may anticipate, then, a significant problem after acquiring your vintage floppy disk drive, as you will have no way to connect it to the host machine – and even if you did, the host machine may not be able to operate the attached floppy disk drive.
For vintage computing enthusiasts, for practitioners in the archives and library field or for individuals who simply want to rescue their personal data off old floppies, there has emerged a small market for modern floppy disk controllers – chip boards designed to connect to a modern machine via USB and facilitate a connection to an old floppy disk drive. Two prominent entries in this field are the FC5025 by Device Side Data and the KryoFlux, put together by the Software Preservation Society. The former is strictly for connection with 5.25” drives, while the latter can operate 5.25”, 3.5” and even 8” drives (8” drives do require additional equipment, however). Again, it is beyond the scope of our overview here to delve into the use cases and specific procedures for these floppy disk controllers, but suffice to say that you will want to acquire these devices after you have secured your floppy disk drives. The particular board depends on the drives and disks you need to process, though the KryoFlux is the most flexible single device in this space. An added benefit is that these floppy disk controllers will also serve as write blocking devices – it never hurts to be extra cautious!
We have covered a lot and we have not yet reached the point where we are actually copying data from the target media! So, before we touch on that last step, here is a review of what we have established so far.
- In born-digital acquisition, respect des fonds is still in play, so we want to capture original order, provenance and other contextual information.
- Merely attaching target media to a host machine can alter the target media and remove the desired contextual data.
- In addition, simply copying the files off the target media will not capture these important pieces of information.
- Because of 2, we want to practise write blocking with our target media.
- Because of 3, we want to take full disk images of the target media instead of the more common operation of file copying.
Hopefully, summaries 1–4 make sense now. To close out our section on born-digital materials on physical carriers, let’s talk about disk images.
The No-nonsense Guide to Born-digital Content is out now by Facet Publishing:
- US/Canada: https://www.alastore.ala.org/content/no-nonsense-guide-born-digital-content
- Outside of US/Canada: http://www.facetpublishing.co.uk/title.php?id=301959
2 thoughts on “Excerpt: No-nonsense Guide to Born-digital Content”
Walker, have you been following the entities work for DSpace 7?