-
Notifications
You must be signed in to change notification settings - Fork 2
YetAnotherImageFormat
There are a lot of formats designed for the archival, image, or otherwise, storage and representation, of the contents of physical media as a file (or several).
With the development of The Disc Image Chef I had to study (and sometimes even reverse engineer from scratch) almost all the available formats for storing of both magnetical and optical media.
Practically all of these formats (with the exception of MAME's CHD format) are proprietary, undocumented formats, that only a closed source application completely knows, being all other implementations the result of poor documentation, or previous reverse engineering efforts.
Some other formats, like QEMU's Copy On Write format (aka, QCOW), while open source and well documented, where designed for the lifecycle of a particular emulator, missing characteristics of the represented media other emulator could need, or being constrained to a series of unrealistic parameters (e.g. expect all hard disks to be 512 bytes per sector).
After implementing about 50 formats in The Disc Image Chef I started adding support to read real physical media and store it in several of those formats, so I had to not only read the structure of existing formats, but to compare how the software where that format originated would use it to store the same physical media.
Doing this, I arrived to a series of conclusions:
- Most formats can be split in one of two types:
- Optical disc formats, for CDs and its descendants
- Block type formats, for hard disks and floppies (also used for flash media)
- Some type of media is stored on a format that was created specifically and only for the emulator that needs it. A typical example are all the formats of the digital data stored in audio tapes.
- All other type of media must be stored in one of those formats, even if they are technically not one of those media.
- Practically all physical media has one or more pieces of data that no single format supports. Some media has several of that pieces and none is supported by most formats.
- Some media tend to compress very well, and in case of media organized in blocks, some media has duplicated blocks. While some formats offer compression, none of them offered deduplication.
- Also, the formats that support compression only support faster algorithms with the lowest compression ratio, or do not compress correctly some tricky formats.
- Most formats are not extensible at all, and being closed source and undocumented, any change to them will create havoc amongst users that want to use your extended-modified image with the original software.
- Not a single format stored information on how the image was created. A notable exception is the BlindWrite image that stores the complete information from the drive.
- Besides this, no format was designed to be able to resume a read, or retry blocks with errors, on a different drive. While on some supporting it is transparent, on others is completely impossible.
- Rich metadata was not an option on any of the analyzed formats.
So this bring me with the idea, of creating, yet another image format, with the following goals:
The next sections will explain in as layman terms as possible the results of the goals previously described. For more technical descriptions please refer to the specification or the source code.
When going through hundreds and hundreds of dumps I observed an interesting trend. The way localized, regional, and translated assets are stored in the formats employed by software of any kind, tends to be very repetitive.
To put it in an example, software in Windows contains the assets, and the code, in several files, split by language. That is, a single file cannot contain several languages. However the code, and most graphical and audiovisual assets, that make the majority of the size of such files, is exactly the same in all languages, changing only the text. So on a dump that contains the same Windows software in several languages, the deduplication possibilities are quite big.
So to better handle this, dicformat is designed so every block of the media is represented by a pointer inside a table (called the Data Deduplication Table, or DDT), and that pointer tells the implementation of the format where in the dicformat file the block is stored. Simply using two pointers in the table for the same block in the dicformat file gives us the deduplication feature. That is, identical blocks from the media are stored only once.
At the same time if the user or implementation does not want to use deduplication (as this is a very memory consuming feature when dumping, using it in memory constrained environments can generate problems, and deduplication can always be applied in a dicformat->dicformat conversion), two identical blocks will simply point to two different locations in the dicformat file.
After deduplicating the data in the media, it is divided in chunks (the size of which is user selectable) for compression.
The chosen algorithms where LZMA for blocks containing arbitrary data and FLAC for blocks containing digital audio (mostly from Compact Discs).
These two formats were chosen by their compression ratio, giving the best compression ratios at the moment for the data they're designed to compress.
Also, because dicformat stores a value indicating exactly what type of media the dump belongs to, the algorithm can be intelligently chosen, changed, or disabled altogether. An example of this would be both Jaguar CD discs and VideoNow discs, that store game data and video data respectively as audio, and FLAC would give a much lower compression ratio than LZMA.
Also, it is another feature that can simply be disabled at will, so the chunks will be uncompressed freely.
Now we've seen the biggest two parts of dicformat explained: the DDT and the data chunks.
But well, all of this pieces need to be easily located!
There comes the headers and the index. The header is the first part of dicformat, being merely a piece of data indicating where to locate the index (more on this below) and the type of media stored by the file.
And the index is where everything is pointed to. The DDT, all the data chunks, etc.
The archival grade becomes as implementations shall not, and by default will not, modify the existing DDT of data chunks, but always append new ones at the end of the file. So when the file finished writings, a new index is created, and the header is updated to point to the new index.
This effectively, becomes copy on write. While an implementation can also choose to modify data in place.
Also this open to the (not yet implemented) possibility of snapshots, as this would just be a list of previous indexes.
And all chunks and sections of dicformat have a checksum verification, so any involuntary modification (e.g. bitrot) can be detected.
It is planned for a future version to add the possibility of an error recovering mechanism to the file that would allow to recover the chunks contents to its correct values in case of unintentional damage, but this is not yet implemented.
We tend to see the contents of media as the data we (or our systems) write on them, that is the user data.
However, media usually tends to store a lot more information than we see, for several purposes. A very specific case, are optical discs (like Compact Discs and DVDs) that are the ones that store more information besides the user data than any other formats.
To put an example, when we think of a Compact Disc, we see 2048 bytes of data per block (or in the case of a Compact Disc Digital Audio, 2352 bytes). This is only the user data.
Besides this user data a Compact Disc Digital Data also contains a header (that indicates the sector number and type), a subheader with information about the sector contents (most of the times unused), two fields for error recovery (so damaged sectors can be repaired) and a field for error detection. All of this extra data amounts of the difference in block size from 2352 bytes in a Compact Disc Digital Audio sector and the 2048 bytes in a Compact Disc Digital Data sector.
Above all of that, also, each sector contains 96 bytes of extra data (called the subchannel). So now we are to 2448 bytes per sector in a CD (be it audio or data).
Also before the user data starts, a CD contains at least 4500 sectors of the Lead-in (that contains the table of contents, allowing the drive to know where each track starts and ends amongst other things), the pregap of the first track (150 sectors), and after all the user data, the Lead-Out.
Enter recordable and rewritable CDs and we are more sections external to the user data, like the PMA.
While most of this data can be automatically generated, some of it is modified by the media creator, usually for copy protection measures.
So I called them tags. Sector tags for that extra data is linked to each sector separately (like the above described header, subheader, ECC and subchannels), or media tags for that extra data that is linked to the media (like the lead-in, first track pregap, lead-out).
But the CD is not the only media having tags. To put some examples, DVDs have sector tags telling if the sector is encrypted, media tags that can be used to differentiate between the different DVD flavors (DVD-RAM vs DVD-RW), LTO tapes have a near-field memory containing tape manufacturing information and usage statistics, Apple and Amiga floppies contain sector tags that some software require, and most floppies contain an extra track with information about its duplication factory.
While some formats are able to store many of these tags, none is able to store all of them. Worse even, some applications modify the tags! (e.g. Alcohol 120% modifies the DVD type in the corresponding media tag, so if you dump a DVD-R or a DVD-RAM, the media tag stored in the image file will tell it was a DVD-ROM in both).
For this reason dicformat is able to store any kind of sector tag, or media tag, using a separate DDT for sector tags, and a data chunk for each media tag stored. Not storing them will occupy no space in the image. Also any new tag (sector of media) discovered can be supported just by adding a new entry to a list of tags.
And while the importance of some tags for preservation is up to discussion, I chose the simplest route: it is not up to me to decide which one is important. If a drive can read it, dicformat can store it.
Compact Discs are a special kind of beast because of its specific characteristics.
Most existing dumps do not include the Lead-In information, but a cuesheet, that is a list of tracks and their duration, sometimes in a per-format way, sometimes with the same structure as the drive returns it. Because of this, dicformat is able to store both the data from the drive and this cuesheet. It is up to the implementation to choose which one to use if the image contains both.
Also sector tags in Compact Disc Digital Data media, unless damaged, or modified for copy protection purposes, is regenerable from the user data. Because of this, and this particular data, is non-compressible neither contains duplicates, when storing a CD, these sectors tags are compared against the generated equivalents, and in the case of being the same, are not stored in the sector tag DDT.
Compact Disc subchannel is also regenerable, but if the application decides to store it, creates a big incompressible chunk of non duplicated data. Because of this, the subchannel is processed thru the Claunia Subchannel Transform, a very simple, lossless algorithm, that reorganizes the subchannel data, making it more compressible (and faster to compress/decompress, including the transform overhead).
As talked before, data stored from media in dicformat is split in chunks of data. This is applicable for all media easily. However some media require storage of information, that is not a tag, but description of the media contents required for the usage of them.
This is for example the case of magnetic tapes that store digital data. These tapes have three characteristics not present in any other kind of media: partitions, files and variable block sizes.
A tape can be partitioned in several independent pieces. Each partition, can contain a number of files (started from zero), with a special marker telling the drive a file has ended. Then a file is divided in blocks.
Variable block size was easy as data chunks are closed and a new one is created every time the block size they contain changes.
But for these tapes new types of chunks where added to dicformat to store the partitions and files information.
So any new media that needs a different paradigm can still use the data chunks to store the data itself but then add new chunk types to describe the specific needs of that media.
Existing implementations can freely ignore these new chunks on read (but must never on convert or write, unless specifically forced by the user) just giving access to the data as is, or choose to not support unknown chunks.
Any kind of archival format must be open source, as otherwise it would be contrary to the very spirit of preservation. And that's why dicformat is LGPL. This also forces any change made to it to stay opensource preventing being overtaken by a closed source entity, yet being able to be used by any closed source software.
That you are reading this is a demonstration of the intentions of having documented the whole format.
And reusability comes in modularity. Being a C# module at the moment, with an in-progress implementation being created in C90 to ensure the format can be used by anyone who wants to use it, including emulators, media dumping solutions, virtual drives, etc.
One (of many) of the things that makes the difference between amateur and professional work is the auditing of every step. And in some environments, it is a strict requirement.
So when dumping a disc, what drive was in use, what format created the dump, and other auditing information about the dump process, must be stored separately in most formats.
However dicformat, supports storing all of this auditing information, in the same file as the data.
Also some media cannot be reliable read by any single drive, so it is of importance to know which drive read which part of the media (the resume information, as it allows to resume an incomplete dump). This information, is also stored by dicformat.
Besides including metadata about the dump process, dicformat is able to store rich metadata information about the dumped media, embedding the CICM XML metadata sidecar inside the dicformat file.
Storing also scans and other images related to the media is in the backlog, yet to be implemented.
All the design of dicformat goes around the most efficient way to be able to store any information a drive can return from a media. For this reason the format does not store any unused chunk, and chunks are designed to have as less overhead as possible.
The DDT however can get quite big above some media size (e.g. >200GiB hard disks), and this problem will be fixed in version 2.0 of the format design. The DDT however does not generate so much overhead in medium of less than 32GiB.
All chunks, tags, DDTs, and the index are checksummed with CRC64, allowing unintentional damage to the dump file to be detected. Also, dicformat allows storage of contents checkums (such as CRC64, SHA1, SHA256, SpamSum) to allow for easy file comparison without the need to read and decompress all the contents of two dicformat files.
The expandability plus the features that can be disabled when unused, are not the only proof of future. When the format was first designed it would only store optical and magnetic disks. Support for storing digital tapes was added easily.
And for the next version several features are already planned, like data position measurement, less overhead for big disks, support for copy protections based on twin sectors, rich information about other copy protections, the ability to store flux data from floppies alongside decoded data, are some of the featured already in the roadmap, and most of them, will generate files that are still readable by previous implementations.
This comes as a demonstration that dicformat is future proof, being able to adapt, and expand, to the needs of any media, and its possible dumps, out there, past, present, or future.