Skip to content

Commit 3287664

Browse files
committed
serialize GroupItem meta prior to content, DocItem meta after content
Signed-off-by: Panos Vagenas <[email protected]>
1 parent a1cacfd commit 3287664

9 files changed

+35
-27
lines changed

docling_core/transforms/serializer/common.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
FloatingItem,
4747
Formatting,
4848
FormItem,
49+
GroupItem,
4950
InlineGroup,
5051
KeyValueItem,
5152
ListGroup,
@@ -454,17 +455,20 @@ def get_parts(
454455
else:
455456
my_visited.add(node.self_ref)
456457

458+
meta_part = create_ser_result()
459+
node_is_group = isinstance(node, GroupItem)
457460
if (
458461
not params.use_legacy_annotations
459462
and node.self_ref not in self.get_excluded_refs(**kwargs)
460463
):
461-
part = self.serialize_meta(
464+
meta_part = self.serialize_meta(
462465
item=node,
463466
level=lvl,
464467
**kwargs,
465468
)
466-
if part.text:
467-
parts.append(part)
469+
if meta_part.text and node_is_group:
470+
# for GroupItems add meta prior to content
471+
parts.append(meta_part)
468472

469473
if params.include_non_meta:
470474
part = self.serialize(
@@ -477,6 +481,10 @@ def get_parts(
477481
if part.text:
478482
parts.append(part)
479483

484+
if meta_part.text and not node_is_group:
485+
# for DocItems add meta after content
486+
parts.append(meta_part)
487+
480488
return parts
481489

482490
@override

test/data/doc/2408.09869v3_enriched.gt.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
<!-- page break -->
44

5-
In this image, we can see some text and images.
6-
75
Figure 1: Sketch of Docling's default processing pipeline. The inner part of the model pipeline is easily customizable and extensible.
86

97
<!-- image -->
108

9+
In this image, we can see some text and images.
10+
1111
licensing (e.g. pymupdf [7]), poor speed or unrecoverable quality issues, such as merged text cells across far-apart text tokens or table columns (pypdfium, PyPDF) [15, 14].
1212

1313
We therefore decided to provide multiple backend choices, and additionally open-source a custombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made available in a separate package named docling-parse and powers the default PDF backend in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may be a safe backup choice in certain cases, e.g. if issues are seen with particular font encodings.

test/data/doc/2408.09869v3_enriched_p1_include_annotations_false.gt.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Docling Technical Report
22

3-
In this image we can see a cartoon image of a duck holding a paper.
4-
53
<!-- image -->
64

5+
In this image we can see a cartoon image of a duck holding a paper.
6+
77
Version 1.0
88

99
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
@@ -22,8 +22,6 @@ With Docling , we open-source a very capable and efficient document conversion t
2222

2323
torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.
2424

25-
{'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
26-
2725
Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.
2826

2927
| CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend |
@@ -32,6 +30,8 @@ Table 1: Runtime characteristics of Docling with the standard model pipeline and
3230
| Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
3331
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
3432

33+
{'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
34+
3535
## 5 Applications
3636

3737
Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.

test/data/doc/2408.09869v3_enriched_p1_mark_annotations_false.gt.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Docling Technical Report
22

3-
In this image we can see a cartoon image of a duck holding a paper.
4-
53
<!-- image -->
64

5+
In this image we can see a cartoon image of a duck holding a paper.
6+
77
Version 1.0
88

99
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
@@ -22,8 +22,6 @@ With Docling , we open-source a very capable and efficient document conversion t
2222

2323
torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.
2424

25-
{'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
26-
2725
summary: Typical Docling setup runtime characterization.
2826
type: performance data
2927

@@ -35,6 +33,8 @@ Table 1: Runtime characteristics of Docling with the standard model pipeline and
3533
| Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
3634
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
3735

36+
{'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
37+
3838
## 5 Applications
3939

4040
Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.

test/data/doc/2408.09869v3_enriched_p1_mark_meta_true.gt.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Docling Technical Report
22

3-
[Description] In this image we can see a cartoon image of a duck holding a paper.
4-
53
<!-- image -->
64

5+
[Description] In this image we can see a cartoon image of a duck holding a paper.
6+
77
Version 1.0
88

99
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
@@ -22,8 +22,6 @@ With Docling , we open-source a very capable and efficient document conversion t
2222

2323
torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.
2424

25-
[Docling Legacy Misc] {'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
26-
2725
summary: Typical Docling setup runtime characterization.
2826
type: performance data
2927

@@ -35,6 +33,8 @@ Table 1: Runtime characteristics of Docling with the standard model pipeline and
3533
| Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
3634
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
3735

36+
[Docling Legacy Misc] {'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
37+
3838
## 5 Applications
3939

4040
Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.

test/data/doc/barchart.gt.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
Bar chart
2-
31
<!-- image -->
42

53
| Number of impellers | single-frequency | multi-frequency |
@@ -10,3 +8,5 @@ Bar chart
108
| 4 | 0.14 | 0.26 |
119
| 5 | 0.16 | 0.25 |
1210
| 6 | 0.24 | 0.24 |
11+
12+
Bar chart

test/data/doc/dummy_doc.yaml.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
22

3+
Figure 1: Four examples of complex page layouts across different document categories
4+
5+
<!-- image -->
6+
37
...
48

59
Bar chart
@@ -8,10 +12,6 @@ CC1=NNC(C2=CN3C=CN=C3C(CC3=CC(F)=CC(F)=C3)=N2)=N1
812

913
{'myanalysis': {'prediction': 'abc'}, 'something_else': {'text': 'aaa'}}
1014

11-
Figure 1: Four examples of complex page layouts across different document categories
12-
13-
<!-- image -->
14-
1515
A description annotation for this table.
1616

1717
{'foo': 'bar'}

test/data/doc/group_with_metadata_default.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ This is some introductory text.
88

99
This section talks about foo.
1010

11-
This paragraph provides more details about foo.
12-
1311
Regarding foo...
1412

13+
This paragraph provides more details about foo.
14+
1515
Here some foo specifics are listed.
1616

1717
1. lorem

test/data/doc/group_with_metadata_marked.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ This is some introductory text.
88

99
[Summary] This section talks about foo.
1010

11-
[Summary] This paragraph provides more details about foo.
12-
1311
Regarding foo...
1412

13+
[Summary] This paragraph provides more details about foo.
14+
1515
[Summary] Here some foo specifics are listed.
1616

1717
1. lorem

0 commit comments

Comments
 (0)