Interpreting a PDF Structure Tree as XML #789
davidcarlisle
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Contents
show-pdf-tagsInterpreting a PDF Structure Tree as XML
Introduction
This document describes an attempt to model a PDF structure tree as XML and the design of Relax NG schemas to express constraints on that tree. The schemas were developed in conjunction with a Lua script that extracts XML from a tagged PDF file, although conceptually the schemas are not tied to a specific tool.
Tools
The Lua script to extract XML, and the Relax NG schema described below are available at GitHub pdf_structure repository.
An online tool allowing a tagged PDF to be uploaded and extracted XML validated is also available at texlive.net/showtags. Links to this online version with preloaded examples are shown below, but you can also upload your own examples (they are not kept on the server).
Why XML?
The structure tree is defined (in section 17.7.1 of PDF2.0) as an annotated tree that is similar to, but explicitly not, XML.
This leads the PDF specification into an anomalous situation. It introduces MathML Structure elements, and normatively references the MathML specification, but nothing in the MathML specification applies to structure elements as they are not XML. Similarly, structure elements have a standard
Schemaattribute that may reference an XML Schema but there is no mechanism to apply an XML schema to structure elements.For languages with a non-xml syntax, it is possible that they define a parsing model that produces an XML object model (notably this is true of HTML). So it is possible to apply XML schemas and other tools to such languages.
Alternatively, it is possible to extract (or convert) the non-XML format to an XML representation. This is similar to, but hopefully simpler than, the steps specified in the Derivation to HTML document.
As far as possible here, we aim for a "direct" translation of the structure tree so that validating the resulting XML may be seen as validating the original document, rather than validating the conversion.
XML Elements generated by `show-pdf-tags`
As described below, most element names relate directly to the names of PDF Structure Elements in the Structure Tree of the PDF file.
show-pdf-tagsdoes make use of three additional elements (in no-namespace).PDFthe top level document element is always<PDF>currently with no attributes, although potentially this could be extended to have attributes (for example for the PDF version).StructTreeRootThis element represents the root of the structure tree. It is currently the only allowed child of<PDF>although future extensions could potentially expose other children of<PDF>such as the catalog or XMP metadata.AssociatedFileThe content of any element that represents a Structure element with an Associated File, will start with a sequence ofAssociatedFileelements each with anameattribute holding the filename, and content representing the file contents.Modeling PDF Structure elements as XML
Element Names
Mostly the structure element name (
/S) can be modeled by an XML element of the same name after first decoding any#encoded bytes. If the name is not a legal XML element name it must be encoded (by an encoding to be specified).Structure Element properties
Certain properties (entries in the dictionary) are modeled by XML attributes in no namespace
Note that these names may potentially clash with attributes specified in other ways as detailed below, but are chosen as the PDF specification and deriving to HTML document imply that (for example) MathML Structure Elements should be valid MathML and directly usable when deriving to HTML. Thus existing attributes such as
idorlangneed to be used here. An alternative extraction would place these attributes in a unique namespaces such aspdf:T="this"rather thantitle="this". See the description of the Standard Structure Element Attribute Owners below. But then the deriving to HTML document would need to define how to produce valid MathML XML elements from a MathML Structure element.The
/AFkey is mapped to anafattribute containing a space separated list of file names, the contents of the associated file are extracted and shown in the XML as the content ofAssociatedFileelements. Associated files containing MathML are directly shown asAssociatedFilecontent, allowing the MathML to be validated. All other associated file content is encoded as XML text (replacing<and&by entity references).PDF Structure Element Attributes
Unlike XML attributes, attributes on a structure element (
/A) may be structured. Here we model the Attribute Owner as an XML Namespace (this seems to be the expected usage) and the attribute value as a string. If the attribute references objects that may not be expressed in a string, the string value in the XML attribute will be a representation of the structure using{and,Lua notation similar to JSON. It would be more natural to represent these structured values as XML elements, however then the implied relationship between structure elements and XML would be completely unusable as no MathML Structure Elements would result in valid XML, and similarly no elements in a PDF Namespace that has aSchemaspecified would be valid to the schema.So the ACM example below includes attributes that contain arrays of arrays numbers represented as
nullentries are not explicitly shown, the[]convention is used if there are non null entries after a null.There are some pre-determined Attribute Owners such as
Layout. To model these as XML namespaces, fixed Namespace URLs are used by prefixing with the PDF Structure element Namespace URL, as follows:Other attributes on the structure element should have a
/NSO(Namespace Owner) entry pointing at a PDF Namespace dictionary, which should have a Namespace URI entry.The most problematic class of attributes to represent are also the most common: an XML attribute in no-namespace. (Unprefixed attributes do not inherit the namespace in scope for unprefixed element names.)
For example
After discussions with the PDF/UA Technical Working Group, we use the convention that a PDF Attribute with Owner
/NSOand/NSreferencing the same namespace object as the element is expressed in XML as a no-namespace attribute.So the above would represent a structure element
/moin namespacehttp://www.w3.org/1998/Math/MathMLwith attributerspacewith Owner/NSOand/NSreferencing the same namespace object as themoelement.This interpretation makes it slightly inconvenient to model an XML element with an attribute in the same namespace as its parent element, so
This is not a serious problem for MathML which has no attributes in the MathML namespace, but could be a problem for other vocabularies. If this is needed, the element and attribute would need to reference different Namespace objects (that used the same URI).
Alternatives that were considered would have been be to model no-namespace XML attributes as PDF attributes having
/NSOa namespace with URI the empty string (it isn't clear if that is valid PDF) or to have PDF attribute without a/NSObut and Owner/UserProperiesas below.User Properties
If the structure element attribute has owner
/UserPropertiesthen each entry in the/Parray is represented by an xml attribute in no-namespace with name using the/Nentry, and value the/Ventry. The optional formatted/Fentry is not represented.Additional Attributes
Currently, a
rolemapped-fromattribute is allowed on any element to denote that the element has been generated by following role maps (see below); similarly, arolemaps-toattribute is allowed showing that an element name is used in a role map.Attribute duplication
Currently, the extractor does not check if these conventions cause duplicate attributes on an XML element, either because a PDF Attribute occurs twice, or if it has one of the reserved names such as
lang. In such a case the resulting XML will not be well formed, and it would not be possible to validate it.Attribute text
As noted above PDF Attributes may have nested structure that is first serialised The resulting string is then mapped to Unicode. Any Unicode characters that may not appear in XML are replaced by conventional markers
[NULL],[CTRL].Note that these last transformations are lossy, and
[NULL]as text is not distinguished from an actual NULL. An alternative encoding could be specified, or the characters could be left as is and the resulting XML accepted as not well-formed (so the document necessarily invalid). This would, however, affect several documents in the PDF/UA-1 reference suite that have NULL bytes in alt texts and titles, e.g.,PDFUA-Ref-2-01_Magazine-danishwhich startsElement Content
Element content (
/K) is modeled by nesting the XML elements.Content items are represented as text for marked content regions that may be converted to text by following the ToUnicode mapping.
Unfortunately, this mapping may not be explicit in the PDF and may be embedded in the font data (the current
pdf_show_tagsextraction does not parse font data so can not map this text, although that does not affect validation). Any characters that may not be mapped to Unicode should be replaced by the text replacement character U+FFFD.Any Unicode characters that may not appear in XML are replaced by conventional markers
[NULL],[CTRL].Other non-text Content Items are marked by Process Instructions denoting the object such as
A process instruction is used here as they are treated as comments by most XML Schema languages so allowed at all points without invalidating the document. Using XML elements to represent Content items would lead to more complicated schema as these would need to be explicitly allowed in all leaf elements.
Design of the Schema
The schemas used are closely modeled on ISO 32005 (2025 draft) and so while they cover both PDF 1.7 (PDF/UA-1) and PDF 2.0 (PDF/UA-2) they take a PDF-2 centric view and for PDF 1.7 enforce "Best Practice" validating containment constraints that were not explicit in PDF/UA-1.
The PDF 1.7 schema is identical to the PDF 2.0 schema except that all elements are in no-namespace, and new elements in PDF 2 are declared, but have a mandatory
rolemap-toattribute to role map to a specific PDF/UA-1 element.Modelling Constraints from ISO 32005
The constraints in 32005 are very weak, normally just numeric constraints with no specified element orderings (which are sometimes expressed in text).
The most common specification is
0..n. A set of children all with that constraint may be modeled by a repeated choice(a|b|c)*Constraints marked
1..nmay be similarly mapped to a schema using+rather than*which has exactly the same meaning of one or more.A constraint of
0..1may be modeled with ana?and combined with,if the element must be first or (more commonly)&if it may appear in any position.A constraint of
∅*means that the element may not appear if the parent is being used as a block element, but may appear if used as grouping. Grouping elements amongst other constraints may not contain text so this is modeled in the schema by providing the parent with a choice of grouping or block content model.Constraints marked
‡,[a]or[b]all mean "refer to the text of the specification". These have been initially mapped as0..nthe most relaxed constraint. If the text provides a schema-enforceable constraint such as "must come first in its parent" then this has been reflected in the schema. The current mapping of these constraints is probably incomplete.For example
/Formis declared asForm = element pdf2:Form { pdf2-attributes, attribute Layout:Width {text}?, attribute Layout:Height {text}?, printfield-attributes, ( (Caption? & (Part|Div|NonStruct|Private|Note|Code|Lbl|Reference|FENote|L|BibEntry|Table|Figure|Formula|Artifact))* # Grouping | (Caption? & (text|Div|NonStruct|Private|Note|Lbl|FENote|BibEntry|Artifact)*) # Block ) }So it may have at most one
Captionchild, and if used as a Block level element may havetext(character content) but may not havePartbut if used as a Grouping element may havePartbut not text.It also takes
WidthandHeightattributes in theLayoutnamespace, in addition to other PDF2 attributes.Validating Standard Attributes
Attributes in the
Layout(and other standard) owner namespace are validated for the attribute name, and for attributes taking one of an enumerated list of values, the value is also validated, so for examplewould mark
Layout:WritingMode="left-right"as invalid as that is not an allowed value, but it does not (currently) try to validate color syntax soLayout:BackgroundColor="a lighter shade of pale"would be valid (but wrong), However, attribute names are validated soLayout:BackgroundColour="red"would be invalid for being British English spelling.Many attributes take numeric values; it would be possible to validate the number syntax in a Relax NG schema, although this is not currently done.
Baring bugs, all the enumeration types and all the names of all attributes in the
Layout,List,Table,PrintFieldandArtifactAttribute owners are declared in the Schema.The schema allows any attribute not in one of the declared namespaces, so any attributes in other standard attribute owners such as CSS3 are automatically valid:
PDF 2 elements
Elements new in PDF 2 are allowed in the PDF 1 schema, but with a required rolemap, for example:
Here
pdf1rolemap-Phas no effect in the PDF 2 schema but in the PDF 1 schema it is defined to declare a requiredrolemap-to="P"attribute. (Some elements allow any role map, the current version may be over-strict in some cases, this could usepdf1rolemap-Anywhich just enforces the rolemap attribute without enforcing any value).PDF 1 elements
PDF 1 specific Structure elements are allowed in both the PDF 1 and PDF 2 schema, following the constraints in ISO 32005. So for example,
TOCis declared aswhere the
pdf1namespace is no-namespace in PDF/UA-1 andhttp://iso.org/pdf/ssnin PDF/UA-2.Extending The Schema for custom namespaces
Many documents use non standard PDF Structure elements with role maps mapping them to standard PDF Structure element names.
When extracting the XML these role mappings may be followed, resulting in an XML which should only use standard names and should be valid to the standard schema. The original name is recorded in a
rolemapped-fromattribute (which is always allowed). For example, the LaTeX-Project Bible example uses custom structure elements for books of the bible, so Genesis startsMapping to XML in this manner has the advantage that the XML may be validated with a standard schema but means that it is not convenient to apply additional constraints such as
Booksare contained inTestamentsand may only containChapters.An alternative mapping to XML instead uses the defined names as XML element names and records the standard PDF structure name obtained by role mapping in a
rolemaps-toattribute. The same structure would be represented byThe Schemas are designed so that they may be easily imported into extension schema, and suitable schema are provided for LaTeX-Project generated examples. In the first form, Testaments, Books, and Chapters are all
Sectand may be validated as such but this allows nesting in any combination. The second form requires a custom schema, but the providedlatex-bibleschema will enforce that a document with a top levelTestamentonly containsBookswhich only containsChapterelements, which only containVerseelements.Provided Schema
The schema are all available in GitHub at https://github.com/latex3/pdf_structure. Links to each schema are in the headings below.
document-pdf-ua2
This is the main schema. As far as possible it encodes all the parent/child constraints from ISO 32005, and all the Standard Attribute names, and enumerated values from PDF 2.0 (ISO 32000-2).
document-pdf-ua1
Apart from some slightly different initial declarations, the main body of this schema is a copy of the UA-2 version.
The main differences are:
Standard Structure elements are in no-namespace.
No MathML Structure Elements
PDF 2 specific elements have mandatory role map atttributes, and in some cases specified values.
More top level elements are allowed, not just a single
Documentwtpdf
This schema may in the end be merged with
document-pdf-ua2. The base schema is essentially 32005 with additional constraints from 32000-2. This schema adds additional constraints from WTPDF.Currently, disallow
HandNoteand add a new enumeratedFENote:NoteTypeattribute forFENote.latex-document
A schema that imports
document-pdf-ua2Then adds several elements modeling the structures in a LaTeX document.latex-document17
A copy of the
latex-documentschema but based on the UA-1 stylesheet.latex-bible
A schema that extends the LaTeX UA-2 schema with specific
Testament,Book,ChapterandVerseelements, then allowing inline elements from the LaTeX schema.latex-bible17
PDF 1.7 version of the bible schema.
latex-play
A custom schema for the Shakespere examples which import the
latex-documentschema but addsSceneDescriptionandSpeakerrole mapped toSpanandStrongrespectively.latex-play17
PDF 1.7 version of the Shakespere schema.
latex-document-switch
This is the default schema used at the showtags web form. It validates a combined grammar that accepts UA-1 and UA-2 grammars, with all LaTeX extensions. The entire schema is just three lines:
Example
The Project example mathml-AF-ex1-se produces this XML (if role maps are not followed)
Discussion of specific examples
Testing the schema against existing PDF/UA-1 and PDF/UA-2 files has shown some open issues detailed below. The issues may relate to problems with the extraction tool, problems with the schema, problems with the document, or no problem at all and simply that the schema is trying to enforce "best practice" and validation failures are sometimes expected.
In each case the heading will open the file at texlive.net allowing the XML to be extracted and validated against various schema options.
PDF/UA-1 Reference Suite.
A set of 9 test files distributed by the PDF Association (1-10, omitting 7 which was withdrawn).
PDFUA-Ref-2-01_Magazine-danish
Test 01 validates, however, this only succeeds as the XML extraction encoded the Unicode NULL character as
[NULL]. The file has many attributes ending with a zero byte (as if for a C language string but PDF strings are not null terminated so this is character data):Other XML representations could be considered, such as
<NULL/>which would produce invalid (or in an attribute, not well formed) XML encodings.PDFUA-Ref-2-02_Invoice
The extracted XML is valid to the schema.
PDFUA-Ref-2-03_AcademicAbstract
The extracted XML is valid to the schema, however, all
ids contain spaces, e.g.,The schema currently allows any text as
id, although most XML vocabularies declare id attributes to require a NTOKEN (i.e., a single name with no white space, like element and attribute names).PDFUA-Ref-2-04_Presentation
No issues.
PDFUA-Ref-2-05_BookChapter-german
Similar to the case with example 01, the extracted XML is valid, but relies on Control characters, which are not allowed in XML, being represented by
[CTRL].For example, this Table of contents entry
PDFUA-Ref-2-06_Brochure
The extracted XML is invalid to the schema as there is a
<Caption>element as a direct child of<Document>which is not allowed by ISO 32005 and so taken as best practice by the schema here.The file is valid to the following complete Relax NG schema
Although, like several other files in this collection it has many
AltandActualTextvalues ending in a null byte, for example:PDFUA-Ref-2-08_BookChapter
There are several marked content regions for which the text could not be extracted, which may be an issue with the tool being used.
The file is marked as invalid as it uses
HeightandWidthLayoutattributes onP:Both PDF 1.7 (Table 10.29) and PDF 2.0 (Table 379) describe these attributes as only applying to Figure, Formula, Table, TH (Table header), or TD (Table data).
The document is valid to this schema which allows these attributes anywhere:
PDFUA-Ref-2-09_Scanned
The extractor fails to find any structure here so produces an invalid empty tree.
pdfinfo -struct-textdoes show a large tree. This is related to (mis)handling of cross reference streams by the Lua PDF library being used. To be investigated...If the PDF is pre-processed with
qpdfto resolve the cross reference streams an XML is correctly extracted byshow_pdf_tags.The XML does not validate as it uses
HeightonH4elementsThe XML does validate if
HeightandWidthare allowed on these elements using the schema shown above for test 08.PDFUA-Ref-2-10_Form
No issues.
Tagged Best Practice Guide
This is a freely available guide published by the PDF Association
Tagged-PDF-Best-Practice-Guide
The text of this guide does call out that
Linkin aTOCIelement is not allowed but often used.The Schema is enforcing the form allowed by PDF 1.7 and also ISO 32005, to have
TOCI/Reference/Link.So the extracted XML is invalid
It also has text as a child of
TOCIvalidation).Finally it has a
Captionas direct child ofDocument.The XML is valid to this schema:
Other publically accessible papers
Kumar & Wang
The recent paper on PDF Accessibility proves a good test file for this representation, making use of structured attributes as noted above. Currently for example
BorderColorappears asso only showing two of the four values, as documented above this represents an array in which only the first two entries are not null.
The file does however use an attribute
Layout:ADBE_MinWidthwhich the schema declares invalid as there is no attribute of that name listed for theLayoutAttribute Owner in PDF 2.0.Beta Was this translation helpful? Give feedback.
All reactions