Skip to content

Skeleton Test Suite Proposed Methodology

Andy Jackson edited this page Mar 14, 2017 · 2 revisions

Introducing the skeleton test suite and future signature development methodology

The skeleton test suite provides a mechanism of creating file format 'shells', or, skeleton files that test the matching algorithm of the DROID format identification tool and test the integrity and discreteness of DROID compatible format signatures, that is, ensuring a one-to-one (1:1) relationship between a signature and the 'file-format' it matches.

Copy and paste the sequence CA FE BA BE into a hex editor, save it, preferably with a meaningful name and a .class extension and the file created will identify in DROID as Java Compiled Object Code, x-fmt/415, (DROID signature file V63).

alt text

Figure 1: Hex representation of the new class-shell.class file

alt text

Figure 2: DROID Identification for class-shell.class

Because this byte sequence is so short and the signature so specific, it is unlikely that it would clash with another in the PRONOM database. A signature for a format that, for example, is only looking for the hex sequence CAFE at the beginning of a file, or one including DROID regular expression syntaxt, e.g. CA[FA:FF]{1}BE will match this file. By creating the skeleton file and checking its identification in DROID we can prove that no signatures with a similar footprint exist in PRONOM; further testing of new signatures against this file and files representing the rest of the signatures in the PRONOM database in future can prove that no signatures are put into the database which will jeopardise the uniqueness of this identification. This is the premis on which this work is based.

Proposed Methodology ~ Create ~ Test ~ Submit ~ Host ~

  • Research your file format
  • Create your DROID compatible signature and signature file
  • Test, using DROID against your format, primary sources
  • Copy and paste the pertinent bytes into a hex editor and save out as a skeleton file/s
  • Test DROID matches your skeleton file
  • Test your signature against a complete skeleton corpus looking for collisions
  • Submit your signature, skeleton file and primary sources (optional) to signature authority
  • Submit your skeleton file to skeleton corpus holder for host-ing

This methodology is iterative. That is, when collisions are discovered using your newly developed signature, it is important to review your work and develop a more robust, non-colliding signature. This process is further iterative if it looks like a change should be made to the existing pattern in the PRONOM database.

How to research and develop signatures for file format identification

Two guides exist to aid in this endeavour:

Signature Development Utility

The signature development utility will output a signature file for a single signature, or, exemplar of format. The XML could be manipulated by hand to enable it to contain multiple signatures similar to the standard DROID signature file.

Benefits

While still attempting to understand the benefits of this approach, some are immediately apparent:

  • Users can understand the stability of DROID Signature File releases over time through testing
  • The files can be easily embedded in unit tests in the DROID project
  • A properly hosted suite of these files will be free from IPR issues that might exist in user submitted files
  • Signature collisions can be detected automatically
  • Skeleton files can be used in any tool adopting DROID signature syntax and so this methodology can also benefit the work done on the Open Planets Foundation (OPF) FIDO project
  • It is easier to complete signature development and submit it to The National Archives

Shortfalls

Equally, there are shortfalls which cannot be addressed by this approach, or at least, be easily addressed by this approach:

  • The suite only ensures DROID can find what it expects to find
  • It is difficult to represent the number of combinations of file implied by DROID signature syntax
  • Skeleton files for OLE2 and container based formats are difficult to create

Benefits? - To Be Decided

  • We can perhaps have more confidence to create signatures based on format specifications alone, e.g. taking this example of a specification for the 'LocoScript 1' file format we have everything needed to create a signature and skeleton-files. Following best practice to create the signature, and testing with skeleton files and against the skeleton suite means we can avoid collisions and mis-identifications and begin to think about incorporating signatures generated like this in a format registry.

The need for a fully-fledged test-suite

Identification is only the first line of digital preservation. This tool, and the potential test suite only ensures the consistency and accuracy of identification from one of the most important tools used for this purpose currently. The development of this work came out of an approach taken during the development of DROID 6.1 at The National Archives. It also complements a paper written by Fetherston and Gollins at The National Archives and published in the International Journal of Digital Curation:

Fetherston, A., Gollins, T. (2012). Towards the Development of a Test Corpus of Digital Objects for the Evaluation of File Format Identification Tools and Signatures. The International Journal of Digital Curation, 7, 16-26. Retrieved from doi.org/10.2218/ijdc.v7i1.211

After identification, it is my view that validation feature extraction is the second line of Digital Preservation. It is at this stage when I feel a fully-fledged test suite will be most beneficial. While also pushing identification tools further, it is the ability for tools in the second stage of the digital preservation workflow to be able to read established 'ground-truthed' files, identify them, understand their properties and be able to present those properties to users and systems to be able to make digital preservation decisions about which is of utmost importance. This cannot realistically be achieved with a skeleton corpus.