Skip to content
This repository has been archived by the owner on Jan 17, 2019. It is now read-only.

Identification

Ivan Herman edited this page Feb 23, 2015 · 2 revisions

Identification of EPUB-WEB documents

This is a very first draft, based on discussions between Markus and Ivan.

1. Every EPUB-WEB document must have an HTTP(S) URI identifier, referred to as the "Canonical Identifier (CID)" of the document.

HTTP(S) URI is preferred, because they plays well with the Web. It is, however, recognized that other schemes are also used in various communities; DOI-s are a typical example. However, to take that example, because there is a canonical mapping of an abstract DOI to a URI (e.g., the DOI doi:10.1186/2041-1480-4-37 corresponds to the URI:http://dx.doi.org/10.1186/2041-1480-4-37, which is resolvable) a DOI can also be used as an identifier as a shorthand for the HTTP(S) equivalent. The (publishing) community will have to maintain what identifiers are usable that way and what the canonical resolvers are.

2. The CID must be stored in the metadata of the EPUB-WEB document.

In EPUB3 that would mean, for example, that the CID must be stored as part of the opf file. Where such metadata will reside for EPUB-WEB is to be decided separately.

3. If and EPUB-WEB document copies of Web resources, the mapping between the CID-relative URI-s for those resources and the resources' original URI-s must be stored as a separate mapping table (referred to as "IDMAP").

4. Reading systems and/or EPUB-WEB compatible browsers should use the IDMAP as a possible URI redirection.

I.e., if an EPUB-WEB document refers to a CID-relative URI, and that URI cannot be dereferenced, the original URI in the IDMAP (if present) should be used. Also, if a general URI must be dereferenced, and that URI is present in the IDMAP, its CID-relative URI should be used instead.

Note that the Web Packaging document offers some means that are somehow reminiscent of this scheme insofar as each part in a package is a relative URI based on the overall URI that is in the header of the whole package. The only exception is that the header there is (probably?) an HTTP URI, not sure how a canonical, non-HTTP URI fits into this picture. See wiki page on packaging.

Scenarios

Cross references locally

An EPUB-WEB document is created as a "complete" EPUB-WEB document, containing all the necessary resources (CSS files, images, etc). The local environment (e.g., reading system) maintains its own administration which "maps" the location of local copies of the documents on their CID-s. If a cross reference among two EPUB-WEB documents occur, these can be resolved locally and off-line if they both appear as local documents; otherwise, these references are used to access the remove resource on the Web. This mechanism is based on the fact that all documents carry their CID.

That also means that any EPUB-WEB document can have references to any other EPUB-WEB document using CID-s, and these references will work regardless of their off-line and/or on-line status (provided, of course, that the EPUB-WEB document is also available on-line).

Documents with large datasets changing from off-line to on-line

An EPUB-WEb document may include large datasets or video files. The publisher may have included these files to the off-line distribution but these are, in fact, local copies of files on the Web. The publisher has therefore added the HTTP(S) addresses of those resources into the IDMAP. When the document is transformed into an on-line document by a browser, those resources are not loaded into the browser directly; instead, a reference to the on-line version, retrieved from the IDMAP, may be used via the Web address.

A Web page is archived locally as an EPUB-WEB document

While creating the EPUB-WEB document each constituent (e.g., an HTML file) is inspected for references that are necessary for the completeness of the resulting document. This includes, typically, CSS or JavaScript files, images, videos, etc. These resources may be remote, i.e,, not identified through a relative URI (with the starting document as a base). In such a case, the resource is copied, stored at a relative URI in the resulting EPUB-WEB document (with a CID-relative URI), but the original URI is also stored in the IDMAP. This ensures possible roundtripping, stability of annotations (e.g., when someone annotates such an image), and the possibility to store the original HTML source file without modification, instead of changing the references on the fly.

Metadata or annotation anchoring

Usage of the CID, and the CID-relative URI-s, means that annotations and metadata may use those as anchors. This means that those annotations and metadata would seamlessly flow between off-line and on-line states.