Interpreting a PDF Structure Tree as XML #789

davidcarlisle · 2025-01-29T18:39:41Z

davidcarlisle
Jan 29, 2025
Maintainer

Interpreting a PDF Structure Tree as XML

Introduction

This document describes an attempt to model a PDF structure tree as XML and the design of Relax NG schemas to express constraints on that tree. The schemas were developed in conjunction with a Lua script that extracts XML from a tagged PDF file, although conceptually the schemas are not tied to a specific tool.

Tools

The Lua script to extract XML, and the Relax NG schema described below are available at GitHub pdf_structure repository.

An online tool allowing a tagged PDF to be uploaded and extracted XML validated is also available at texlive.net/showtags. Links to this online version with preloaded examples are shown below, but you can also upload your own examples (they are not kept on the server).

Why XML?

The structure tree is defined (in section 17.7.1 of PDF2.0) as an annotated tree that is similar to, but explicitly not, XML.

This leads the PDF specification into an anomalous situation. It introduces MathML Structure elements, and normatively references the MathML specification, but nothing in the MathML specification applies to structure elements as they are not XML. Similarly, structure elements have a standard Schema attribute that may reference an XML Schema but there is no mechanism to apply an XML schema to structure elements.

For languages with a non-xml syntax, it is possible that they define a parsing model that produces an XML object model (notably this is true of HTML). So it is possible to apply XML schemas and other tools to such languages.

Alternatively, it is possible to extract (or convert) the non-XML format to an XML representation. This is similar to, but hopefully simpler than, the steps specified in the Derivation to HTML document.

As far as possible here, we aim for a "direct" translation of the structure tree so that validating the resulting XML may be seen as validating the original document, rather than validating the conversion.

XML Elements generated by `show-pdf-tags`

As described below, most element names relate directly to the names of PDF Structure Elements in the Structure Tree of the PDF file. show-pdf-tags does make use of three additional elements (in no-namespace).

PDF the top level document element is always <PDF> currently with no attributes, although potentially this could be extended to have attributes (for example for the PDF version).
StructTreeRoot This element represents the root of the structure tree. It is currently the only allowed child of <PDF> although future extensions could potentially expose other children of <PDF> such as the catalog or XMP metadata.
AssociatedFile The content of any element that represents a Structure element with an Associated File, will start with a sequence of AssociatedFile elements each with a name attribute holding the filename, and content representing the file contents.

Modeling PDF Structure elements as XML

Element Names

Mostly the structure element name (/S) can be modeled by an XML element of the same name after first decoding any # encoded bytes. If the name is not a legal XML element name it must be encoded (by an encoding to be specified).

Structure Element properties

Certain properties (entries in the dictionary) are modeled by XML attributes in no namespace

Structure Element Key	XML Attribute
ID	id
C	class
R	revision
T	title
Lang	lang
Alt	alt
E	expansion
ActualText	actualtext
Phoneme	phoneme
PhoneticAlphabet	phonetic-alphabet
AF	af

Note that these names may potentially clash with attributes specified in other ways as detailed below, but are chosen as the PDF specification and deriving to HTML document imply that (for example) MathML Structure Elements should be valid MathML and directly usable when deriving to HTML. Thus existing attributes such as id or lang need to be used here. An alternative extraction would place these attributes in a unique namespaces such as pdf:T="this" rather than title="this". See the description of the Standard Structure Element Attribute Owners below. But then the deriving to HTML document would need to define how to produce valid MathML XML elements from a MathML Structure element.

The /AF key is mapped to an af attribute containing a space separated list of file names, the contents of the associated file are extracted and shown in the XML as the content of AssociatedFile elements. Associated files containing MathML are directly shown as AssociatedFile content, allowing the MathML to be validated. All other associated file content is encoded as XML text (replacing < and & by entity references).

PDF Structure Element Attributes

Unlike XML attributes, attributes on a structure element (/A) may be structured. Here we model the Attribute Owner as an XML Namespace (this seems to be the expected usage) and the attribute value as a string. If the attribute references objects that may not be expressed in a string, the string value in the XML attribute will be a representation of the structure using { and , Lua notation similar to JSON. It would be more natural to represent these structured values as XML elements, however then the implied relationship between structure elements and XML would be completely unusable as no MathML Structure Elements would result in valid XML, and similarly no elements in a PDF Namespace that has a Schema specified would be valid to the schema.

So the ACM example below includes attributes that contain arrays of arrays numbers represented as

Layout:BorderColor="{ 0, 0, 0 }"
...
Layout:BorderColor="{{ 0, 0, 0 }, { 0, 0, 0 } }"
...
Layout:BorderColor="{ [2] = { 0, 0, 0 } }"

null entries are not explicitly shown, the [] convention is used if there are non null entries after a null.

There are some pre-determined Attribute Owners such as Layout. To model these as XML namespaces, fixed Namespace URLs are used by prefixing with the PDF Structure element Namespace URL, as follows:

Owner	namespace URI
Layout	http://iso.org/pdf/ssn/Layout
PrintField	http://iso.org/pdf/ssn/PrintField
Table	http://iso.org/pdf/ssn/Table
List	http://iso.org/pdf/ssn/List
Artifact	http://iso.org/pdf/ssn/Artifact

Other attributes on the structure element should have a /NSO (Namespace Owner) entry pointing at a PDF Namespace dictionary, which should have a Namespace URI entry.

The most problematic class of attributes to represent are also the most common: an XML attribute in no-namespace. (Unprefixed attributes do not inherit the namespace in scope for unprefixed element names.)

For example

<mo xmlns="http://www.w3.org/1998/Math/MathML" rspace="3pt">+</mo>

After discussions with the PDF/UA Technical Working Group, we use the convention that a PDF Attribute with Owner /NSO and /NS referencing the same namespace object as the element is expressed in XML as a no-namespace attribute.

So the above would represent a structure element /mo in namespace http://www.w3.org/1998/Math/MathML with attribute rspace with Owner /NSO and /NS referencing the same namespace object as the mo element.

This interpretation makes it slightly inconvenient to model an XML element with an attribute in the same namespace as its parent element, so

<m:mo xmlns:m="http://www.w3.org/1998/Math/MathML" m:rspace="3pt">+</mo>

This is not a serious problem for MathML which has no attributes in the MathML namespace, but could be a problem for other vocabularies. If this is needed, the element and attribute would need to reference different Namespace objects (that used the same URI).

Alternatives that were considered would have been be to model no-namespace XML attributes as PDF attributes having /NSO a namespace with URI the empty string (it isn't clear if that is valid PDF) or to have PDF attribute without a /NSO but and Owner /UserProperies as below.

User Properties

If the structure element attribute has owner /UserProperties then each entry in the /P array is represented by an xml attribute in no-namespace with name using the /N entry, and value the /V entry. The optional formatted /F entry is not represented.

Additional Attributes

Currently, a rolemapped-from attribute is allowed on any element to denote that the element has been generated by following role maps (see below); similarly, a rolemaps-to attribute is allowed showing that an element name is used in a role map.

Attribute duplication

Currently, the extractor does not check if these conventions cause duplicate attributes on an XML element, either because a PDF Attribute occurs twice, or if it has one of the reserved names such as lang. In such a case the resulting XML will not be well formed, and it would not be possible to validate it.

Attribute text

As noted above PDF Attributes may have nested structure that is first serialised The resulting string is then mapped to Unicode. Any Unicode characters that may not appear in XML are replaced by conventional markers [NULL], [CTRL].

Note that these last transformations are lossy, and [NULL] as text is not distinguished from an actual NULL. An alternative encoding could be specified, or the characters could be left as is and the resulting XML accepted as not well-formed (so the document necessarily invalid). This would, however, affect several documents in the PDF/UA-1 reference suite that have NULL bytes in alt texts and titles, e.g., PDFUA-Ref-2-01_Magazine-danish which starts

<Document>
 <Sect title="Forside[NULL]">

Element Content

Element content (/K) is modeled by nesting the XML elements.

Content items are represented as text for marked content regions that may be converted to text by following the ToUnicode mapping.

Unfortunately, this mapping may not be explicit in the PDF and may be embedded in the font data (the current pdf_show_tags extraction does not parse font data so can not map this text, although that does not affect validation). Any characters that may not be mapped to Unicode should be replaced by the text replacement character U+FFFD.

Any Unicode characters that may not appear in XML are replaced by conventional markers [NULL], [CTRL].

Other non-text Content Items are marked by Process Instructions denoting the object such as

<?ReferencedObject type="Annot" page="57" ?>

A process instruction is used here as they are treated as comments by most XML Schema languages so allowed at all points without invalidating the document. Using XML elements to represent Content items would lead to more complicated schema as these would need to be explicitly allowed in all leaf elements.

Design of the Schema

The schemas used are closely modeled on ISO 32005 (2025 draft) and so while they cover both PDF 1.7 (PDF/UA-1) and PDF 2.0 (PDF/UA-2) they take a PDF-2 centric view and for PDF 1.7 enforce "Best Practice" validating containment constraints that were not explicit in PDF/UA-1.

The PDF 1.7 schema is identical to the PDF 2.0 schema except that all elements are in no-namespace, and new elements in PDF 2 are declared, but have a mandatory rolemap-to attribute to role map to a specific PDF/UA-1 element.

Modelling Constraints from ISO 32005

The constraints in 32005 are very weak, normally just numeric constraints with no specified element orderings (which are sometimes expressed in text).

The most common specification is 0..n. A set of children all with that constraint may be modeled by a repeated choice (a|b|c)*

Constraints marked 1..n may be similarly mapped to a schema using + rather than * which has exactly the same meaning of one or more.

A constraint of 0..1 may be modeled with an a? and combined with , if the element must be first or (more commonly) & if it may appear in any position.

A constraint of ∅* means that the element may not appear if the parent is being used as a block element, but may appear if used as grouping. Grouping elements amongst other constraints may not contain text so this is modeled in the schema by providing the parent with a choice of grouping or block content model.

Constraints marked ‡, [a] or [b] all mean "refer to the text of the specification". These have been initially mapped as 0..n the most relaxed constraint. If the text provides a schema-enforceable constraint such as "must come first in its parent" then this has been reflected in the schema. The current mapping of these constraints is probably incomplete.

For example /Form is declared as

Form = element pdf2:Form {
  pdf2-attributes,
  attribute Layout:Width {text}?,
  attribute Layout:Height {text}?,
  printfield-attributes,
(
(Caption? & (Part|Div|NonStruct|Private|Note|Code|Lbl|Reference|FENote|L|BibEntry|Table|Figure|Formula|Artifact))* # Grouping
|
(Caption? & (text|Div|NonStruct|Private|Note|Lbl|FENote|BibEntry|Artifact)*) # Block
)
}

So it may have at most one Caption child, and if used as a Block level element may have text (character content) but may not have Part but if used as a Grouping element may have Part but not text.

It also takes Width and Height attributes in the Layout namespace, in addition to other PDF2 attributes.

Validating Standard Attributes

Attributes in the Layout (and other standard) owner namespace are validated for the attribute name, and for attributes taking one of an enumerated list of values, the value is also validated, so for example

  attribute Layout:WritingMode {"LrTb" | "RlTb" | "TbRl" | "TbLr" | "LrBt" | "RlBt" | "BtRl" | "BtLr"}?,
  attribute Layout:BackgroundColor {text}?,
  attribute Layout:BorderColor {text}?,

would mark Layout:WritingMode="left-right" as invalid as that is not an allowed value, but it does not (currently) try to validate color syntax so Layout:BackgroundColor="a lighter shade of pale" would be valid (but wrong), However, attribute names are validated so Layout:BackgroundColour="red" would be invalid for being British English spelling.

Many attributes take numeric values; it would be possible to validate the number syntax in a Relax NG schema, although this is not currently done.

Baring bugs, all the enumeration types and all the names of all attributes in the Layout, List, Table, PrintField and Artifact Attribute owners are declared in the Schema.

The schema allows any attribute not in one of the declared namespaces, so any attributes in other standard attribute owners such as CSS3 are automatically valid:

# Standard Attribute Owner Namespaces (such as CSS3, HTML-5.00) apart from the ones listed here
# and also vendor-specific namespaced attributes are allowed with any name/value.
otherns-attributes =
  attribute (* - (nons:*|Layout:*|PrintField:*|Table:*|List:*|Artifact:*)) {text}*

PDF 2 elements

Elements new in PDF 2 are allowed in the PDF 1 schema, but with a required rolemap, for example:


Title = element pdf2:Title {
 pdf1rolemap-P,
 pdf2-attributes,
 (text|Part|Div|Aside|NonStruct|Private|P|Note|Code|Lbl|Em|Strong|Span|Quote|Link|Reference|Annot|Form|Ruby|Warichu|FENote|L|BibEntry|Table|Caption|Figure|Formula|Artifact)*
 }

Here pdf1rolemap-P has no effect in the PDF 2 schema but in the PDF 1 schema it is defined to declare a required rolemap-to="P" attribute. (Some elements allow any role map, the current version may be over-strict in some cases, this could use pdf1rolemap-Any which just enforces the rolemap attribute without enforcing any value).

PDF 1 elements

PDF 1 specific Structure elements are allowed in both the PDF 1 and PDF 2 schema, following the constraints in ISO 32005. So for example, TOC is declared as

TOC = element pdf1:TOC {
pdf2-attributes,
   (Part|TOC|TOCI|NonStruct|Private|Caption|Artifact)*
   }

where the pdf1 namespace is no-namespace in PDF/UA-1 and http://iso.org/pdf/ssn in PDF/UA-2.

Extending The Schema for custom namespaces

Many documents use non standard PDF Structure elements with role maps mapping them to standard PDF Structure element names.

When extracting the XML these role mappings may be followed, resulting in an XML which should only use standard names and should be valid to the standard schema. The original name is recorded in a rolemapped-from attribute (which is always allowed). For example, the LaTeX-Project Bible example uses custom structure elements for books of the bible, so Genesis starts

  <Sect xmlns="http://iso.org/pdf2/ssn"
     id="ID.000242"
     xmlns:orig-ns="https://www.latex-project.org/ns/local/bible"
     rolemapped-from="orig-ns:Book"
    >
   <H2 xmlns="http://iso.org/pdf2/ssn"
      id="ID.000243"
      title="Genesis"
      xmlns:orig-ns="https://www.latex-project.org/ns/local/bible"
      rolemapped-from="orig-ns:Book-Title"
      referenced-as="3"
     >

Mapping to XML in this manner has the advantage that the XML may be validated with a standard schema but means that it is not convenient to apply additional constraints such as Books are contained in Testaments and may only contain Chapters.

An alternative mapping to XML instead uses the defined names as XML element names and records the standard PDF structure name obtained by role mapping in a rolemaps-to attribute. The same structure would be represented by

  <Book xmlns="https://www.latex-project.org/ns/local/bible"
     id="ID.000242"
     rolemaps-to="Sect"
    >
   <Book-Title xmlns="https://www.latex-project.org/ns/local/bible"
      id="ID.000243"
      title="Genesis"
      rolemaps-to="H2"
      referenced-as="3"
     >

The Schemas are designed so that they may be easily imported into extension schema, and suitable schema are provided for LaTeX-Project generated examples. In the first form, Testaments, Books, and Chapters are all Sect and may be validated as such but this allows nesting in any combination. The second form requires a custom schema, but the provided latex-bible schema will enforce that a document with a top level Testament only contains Books which only contains Chapter elements, which only contain Verse elements.

Provided Schema

The schema are all available in GitHub at https://github.com/latex3/pdf_structure. Links to each schema are in the headings below.

document-pdf-ua2

This is the main schema. As far as possible it encodes all the parent/child constraints from ISO 32005, and all the Standard Attribute names, and enumerated values from PDF 2.0 (ISO 32000-2).

document-pdf-ua1

Apart from some slightly different initial declarations, the main body of this schema is a copy of the UA-2 version.

The main differences are:

Standard Structure elements are in no-namespace.
```
namespace pdf1 = ""
namespace pdf2 = ""
```

No MathML Structure Elements

# No MathML Structure Elements
math = notAllowed

PDF 2 specific elements have mandatory role map atttributes, and in some cases specified values.

# Attributes When Rolemaps are needed for PDF2  elements used in PDF1
pdf1rolemap-Any = attribute rolemaps-to {text}
pdf1rolemap-P = attribute rolemaps-to {"P"}
pdf1rolemap-Span = attribute rolemaps-to {"Span"}
pdf1rolemap-Note = attribute rolemaps-to {"Note"}

More top level elements are allowed, not just a single Document

# UA-1 does not force a single document element.
start = Document|DocumentFragment|Part|Art|Div|Sect|TOC|Aside|BlockQuote|
        NonStruct|Private|P|Note|Code|Hn|H|Title|Link|Annot|Form|FENote|
    Index|L|Table|Figure|Formula|Artifact

wtpdf

This schema may in the end be merged with document-pdf-ua2. The base schema is essentially 32005 with additional constraints from 32000-2. This schema adds additional constraints from WTPDF.

Currently, disallow H and Note and add a new enumerated FENote:NoteType attribute for FENote.

latex-document

A schema that imports document-pdf-ua2 Then adds several elements modeling the structures in a LaTeX document.

latex-document17

A copy of the latex-document schema but based on the UA-1 stylesheet.

latex-bible

A schema that extends the LaTeX UA-2 schema with specific Testament, Book, Chapter and Verse elements, then allowing inline elements from the LaTeX schema.

latex-bible17

PDF 1.7 version of the bible schema.

latex-play

A custom schema for the Shakespere examples which import the latex-document schema but adds SceneDescription and Speaker role mapped to Span and Strong respectively.

latex-play17

PDF 1.7 version of the Shakespere schema.

latex-document-switch

This is the default schema used at the showtags web form. It validates a combined grammar that accepts UA-1 and UA-2 grammars, with all LaTeX extensions. The entire schema is just three lines:

ua2 = external "latex-bible.rnc"
ua1 = external "latex-bible17.rnc"
start = ua2 | ua1

Example

The Project example mathml-AF-ex1-se produces this XML (if role maps are not followed)

<Document xmlns="http://iso.org/pdf2/ssn"
   id="ID.002"
  >
 <text-unit xmlns="https://www.latex-project.org/ns/dflt"
    id="ID.005"
    rolemaps-to="Part"
   >
  <Title xmlns="http://iso.org/pdf2/ssn"
     id="ID.006"
    >
   <text xmlns="https://www.latex-project.org/ns/dflt"
      id="ID.007"
      xmlns:Layout="http://iso.org/pdf/ssn/Layout"
      Layout:TextAlign="Center"
      rolemaps-to="P"
     >
    <?MarkedContent page="1" ?>Math Test One
   </text>
  </Title>
  <text xmlns="https://www.latex-project.org/ns/dflt"
     id="ID.008"
     xmlns:Layout="http://iso.org/pdf/ssn/Layout"
     Layout:TextAlign="Center"
     rolemaps-to="P"
    >
   <?MarkedContent page="1" ?> David
  </text>
  <text xmlns="https://www.latex-project.org/ns/dflt"
     id="ID.009"
     xmlns:Layout="http://iso.org/pdf/ssn/Layout"
     Layout:TextAlign="Center"
     rolemaps-to="P"
    >
   <?MarkedContent page="1" ?>January 3, 2025
  </text>
 </text-unit>
 <Sect xmlns="http://iso.org/pdf2/ssn"
    id="ID.010"
   >
  <section xmlns="https://www.latex-project.org/ns/dflt"
     id="ID.011"
     rolemaps-to="H1"
    >
   <Lbl xmlns="http://iso.org/pdf2/ssn"
      id="ID.012"
     >
    <?MarkedContent page="1" ?>1 
   </Lbl>
   <?MarkedContent page="1" ?>Math Tests
  </section>
  <text-unit xmlns="https://www.latex-project.org/ns/dflt"
     id="ID.013"
     rolemaps-to="Part"
    >
   <text xmlns="https://www.latex-project.org/ns/dflt"
      id="ID.014"
      xmlns:Layout="http://iso.org/pdf/ssn/Layout"
      Layout:TextAlign="Justify"
      rolemaps-to="P"
     >
    <?MarkedContent page="1" ?>Some inline math, let 
    <Formula xmlns="http://iso.org/pdf2/ssn"
       id="ID.015"
       title="math"
       xmlns:Layout="http://iso.org/pdf/ssn/Layout"
       Layout:Placement="Inline"
      >
     <?MarkedContent page="1" ?>
     <math xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.016"
       >
      <mi xmlns="http://www.w3.org/1998/Math/MathML"
         id="ID.017"
        >
       <?MarkedContent page="1" ?>𝑥
      </mi>
     </math>
    </Formula>
    <?MarkedContent page="1" ?> and 
    <Formula xmlns="http://iso.org/pdf2/ssn"
       id="ID.018"
       title="math"
       xmlns:Layout="http://iso.org/pdf/ssn/Layout"
       Layout:Placement="Inline"
      >
     <?MarkedContent page="1" ?>
     <math xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.019"
       >
      <mi xmlns="http://www.w3.org/1998/Math/MathML"
         id="ID.020"
        >
       <?MarkedContent page="1" ?>𝑦
      </mi>
     </math>
    </Formula>
    <?MarkedContent page="1" ?> satisfy 
    <Formula xmlns="http://iso.org/pdf2/ssn"
       id="ID.021"
       title="math"
       xmlns:Layout="http://iso.org/pdf/ssn/Layout"
       Layout:Placement="Inline"
      >
     <?MarkedContent page="1" ?>
     <math xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.022"
       >
      <mi xmlns="http://www.w3.org/1998/Math/MathML"
         id="ID.023"
        >
       <?MarkedContent page="1" ?>𝑥
      </mi>
      <mo xmlns="http://www.w3.org/1998/Math/MathML"
         id="ID.024"
         lspace="0.278em"
         rspace="0.278em"
        >
       <?MarkedContent page="1" ?>>
      </mo>
      <mi xmlns="http://www.w3.org/1998/Math/MathML"
         id="ID.025"
        >
       <?MarkedContent page="1" ?>𝑦
      </mi>
     </math>
    </Formula>
    <?MarkedContent page="1" ?>.
   </text>
  </text-unit>
  <text-unit xmlns="https://www.latex-project.org/ns/dflt"
     id="ID.026"
     rolemaps-to="Part"
    >
   <text xmlns="https://www.latex-project.org/ns/dflt"
      id="ID.027"
      xmlns:Layout="http://iso.org/pdf/ssn/Layout"
      Layout:TextAlign="Justify"
      rolemaps-to="P"
     >
    <?MarkedContent page="1" ?>Some text, and an equation.
   </text>
   <Formula xmlns="http://iso.org/pdf2/ssn"
      id="ID.028"
      title="equation*"
      xmlns:Layout="http://iso.org/pdf/ssn/Layout"
      Layout:Placement="Block"
     >
    <math xmlns="http://www.w3.org/1998/Math/MathML"
       id="ID.029"
       display="block"
      >
     <msqrt xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.030"
       >
      <msup xmlns="http://www.w3.org/1998/Math/MathML"
         id="ID.031"
        >
       <mi xmlns="http://www.w3.org/1998/Math/MathML"
          id="ID.032"
         >
        <?MarkedContent page="1" ?>𝑥
       </mi>
       <mn xmlns="http://www.w3.org/1998/Math/MathML"
          id="ID.033"
         >
        <?MarkedContent page="1" ?>2
       </mn>
      </msup>
     </msqrt>
     <mo xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.034"
        lspace="0.278em"
        rspace="0"
       >
      <?MarkedContent page="1" ?>=
     </mo>
     <mo xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.035"
        lspace="0.278em"
        rspace="0"
        stretchy="false"
       >
      <?MarkedContent page="1" ?>|
     </mo>
     <mi xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.036"
       >
      <?MarkedContent page="1" ?>𝑥
     </mi>
     <mo xmlns="http://www.w3.org/1998/Math/MathML"
        id="ID.037"
        lspace="0"
        rspace="0"
        stretchy="false"
       >
      <?MarkedContent page="1" ?>|
     </mo>
    </math>
   </Formula>
  </text-unit>
 </Sect>
</Document>

Discussion of specific examples

Testing the schema against existing PDF/UA-1 and PDF/UA-2 files has shown some open issues detailed below. The issues may relate to problems with the extraction tool, problems with the schema, problems with the document, or no problem at all and simply that the schema is trying to enforce "best practice" and validation failures are sometimes expected.

In each case the heading will open the file at texlive.net allowing the XML to be extracted and validated against various schema options.

PDF/UA-1 Reference Suite.

A set of 9 test files distributed by the PDF Association (1-10, omitting 7 which was withdrawn).

PDFUA-Ref-2-01_Magazine-danish

Test 01 validates, however, this only succeeds as the XML extraction encoded the Unicode NULL character as [NULL]. The file has many attributes ending with a zero byte (as if for a C language string but PDF strings are not null terminated so this is character data):

<Document>
 <Sect title="Forside[NULL]">

Other XML representations could be considered, such as <NULL/> which would produce invalid (or in an attribute, not well formed) XML encodings.

PDFUA-Ref-2-02_Invoice

The extracted XML is valid to the schema.

PDFUA-Ref-2-03_AcademicAbstract

The extracted XML is valid to the schema, however, all ids contain spaces, e.g.,

 <P
    id="AD000000-0000-0000-ADBE-        1981"
    title=""
    xmlns:Layout="http://iso.org/pdf/ssn/Layout"
    Layout:LineHeight="11"
    Layout:SpaceAfter="15.75"
    Layout:TextIndent="10"
   >

The schema currently allows any text as id, although most XML vocabularies declare id attributes to require a NTOKEN (i.e., a single name with no white space, like element and attribute names).

PDFUA-Ref-2-04_Presentation

No issues.

PDFUA-Ref-2-05_BookChapter-german

Similar to the case with example 01, the extracted XML is valid, but relies on Control characters, which are not allowed in XML, being represented by [CTRL].

For example, this Table of contents entry

      <Link
        >
       <?ReferencedObject type="Annot" page="1" ?>
       <Lbl
          title=""
         >
        <?MarkedContent page="1" ?>3.1
        <?MarkedContent page="1" ?> 
       </Lbl>
       <?MarkedContent page="1" ?>Einleitung 
       <Span
          xmlns:Layout="http://iso.org/pdf/ssn/Layout"
          Layout:TextDecorationType="Underline"
          Layout:LineHeight="11"
         >
        <?MarkedContent page="1" ?>[CTRL]
       </Span>
       <?MarkedContent page="1" ?> 2
      </Link>

PDFUA-Ref-2-06_Brochure

The extracted XML is invalid to the schema as there is a <Caption> element as a direct child of <Document> which is not allowed by ISO 32005 and so taken as best practice by the schema here.

 <Caption
    title=""
   >
  <P
     title=""
    >
   <?MarkedContent page="1" ?>Photo © 2009 by Jake Davies · 
   <Link
     >
    <?ReferencedObject type="Annot" page="1" ?>
    <?MarkedContent page="1" ?>ojisanjake.blogspot.de
   </Link>
  </P>
 </Caption>

The file is valid to the following complete Relax NG schema

include "document-pdf-ua1.rnc"
document.content |= Caption

Although, like several other files in this collection it has many Alt and ActualText values ending in a null byte, for example:

   <Figure
      alt="Guiding stripes on sidewalks for the visually impaired[NULL]"

PDFUA-Ref-2-08_BookChapter

There are several marked content regions for which the text could not be extracted, which may be an issue with the tool being used.

The file is marked as invalid as it uses Height and Width Layout attributes on P:

   <P
      title=""
      xmlns:Layout="http://iso.org/pdf/ssn/Layout"
      Layout:BBox="{ 186.771, 490.903, 609.249, 750.249 }"
      Layout:Height="259.346"
      Layout:Width="422.478"
     >

Both PDF 1.7 (Table 10.29) and PDF 2.0 (Table 379) describe these attributes as only applying to Figure, Formula, Table, TH (Table header), or TD (Table data).

The document is valid to this schema which allows these attributes anywhere:


namespace Layout = "http://iso.org/pdf/ssn/Layout"
include "document-pdf-ua1.rnc" 

pdf2-attributes &= attribute Layout:Width {text}? &
                   attribute Layout:Height {text}?

PDFUA-Ref-2-09_Scanned

The extractor fails to find any structure here so produces an invalid empty tree. pdfinfo -struct-text does show a large tree. This is related to (mis)handling of cross reference streams by the Lua PDF library being used. To be investigated...

If the PDF is pre-processed with qpdf to resolve the cross reference streams an XML is correctly extracted by show_pdf_tags.

The XML does not validate as it uses Height on H4 elements

  <H4
     title=""
     xmlns:Layout="http://iso.org/pdf/ssn/Layout"
     Layout:BBox="{ 10.7614, 525.977, 46.5068, 545.381 }"
     Layout:Width="35.7454"
     Layout:Height="19.4042"
    >

The XML does validate if Height and Width are allowed on these elements using the schema shown above for test 08.

PDFUA-Ref-2-10_Form

No issues.

Tagged Best Practice Guide

This is a freely available guide published by the PDF Association

Tagged-PDF-Best-Practice-Guide

The text of this guide does call out that Link in a TOCI element is not allowed but often used.

4.1.4.2 Links > > PDF 1.7 does not permit a <Link> element as a direct child of a <TOCI>, however, <Link> elements commonly do exist (even though they are not required or suggested by ISO 32000-1 or PDF/UA-1) within <TOCI> elements.

The Schema is enforcing the form allowed by PDF 1.7 and also ISO 32005, to have TOCI / Reference / Link.

So the extracted XML is invalid

 <TOC
   >
  <TOCI
    >
   <Link
     >
    <?ReferencedObject type="Annot" page="2" ?>
    <?MarkedContent page="2" ?>1 Background ......................................................... 4
   </Link>
   <?MarkedContent page="2" ?> 
  </TOCI>

It also has text as a child of TOCI validation).

Finally it has a Caption as direct child of Document.

The XML is valid to this schema:

include "document-pdf-ua1.rnc" {
  TOCI = element TOCI { (text|Link)*  }
}
document.content |= Caption

Other publically accessible papers

Kumar & Wang

The recent paper on PDF Accessibility proves a good test file for this representation, making use of structured attributes as noted above. Currently for example BorderColor appears as

Layout:BorderColor="{ { 0, 0, 0 }, { 0, 0, 0 } }"

so only showing two of the four values, as documented above this represents an array in which only the first two entries are not null.

The file does however use an attribute Layout:ADBE_MinWidth which the schema declares invalid as there is no attribute of that name listed for the Layout Attribute Owner in PDF 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreting a PDF Structure Tree as XML #789

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Interpreting a PDF Structure Tree as XML #789

Uh oh!

Uh oh!

davidcarlisle Jan 29, 2025 Maintainer

Contents

Interpreting a PDF Structure Tree as XML

Introduction

Tools

Why XML?

XML Elements generated by `show-pdf-tags`

Modeling PDF Structure elements as XML

Element Names

Structure Element properties

PDF Structure Element Attributes

User Properties

Additional Attributes

Attribute duplication

Attribute text

Element Content

Design of the Schema

Modelling Constraints from ISO 32005

Validating Standard Attributes

PDF 2 elements

PDF 1 elements

Extending The Schema for custom namespaces

Provided Schema

Example

Discussion of specific examples

PDF/UA-1 Reference Suite.

Tagged Best Practice Guide

Other publically accessible papers

Replies: 0 comments

davidcarlisle
Jan 29, 2025
Maintainer