Skip to content

Commit 5fc98e3

Browse files
committed
restore ser order for all nodeitems
Signed-off-by: Panos Vagenas <[email protected]>
1 parent 3287664 commit 5fc98e3

10 files changed

+26
-39
lines changed

docling_core/transforms/serializer/common.py

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@
4646
FloatingItem,
4747
Formatting,
4848
FormItem,
49-
GroupItem,
5049
InlineGroup,
5150
KeyValueItem,
5251
ListGroup,
@@ -455,20 +454,17 @@ def get_parts(
455454
else:
456455
my_visited.add(node.self_ref)
457456

458-
meta_part = create_ser_result()
459-
node_is_group = isinstance(node, GroupItem)
460457
if (
461458
not params.use_legacy_annotations
462459
and node.self_ref not in self.get_excluded_refs(**kwargs)
463460
):
464-
meta_part = self.serialize_meta(
461+
part = self.serialize_meta(
465462
item=node,
466463
level=lvl,
467464
**kwargs,
468465
)
469-
if meta_part.text and node_is_group:
470-
# for GroupItems add meta prior to content
471-
parts.append(meta_part)
466+
if part.text:
467+
parts.append(part)
472468

473469
if params.include_non_meta:
474470
part = self.serialize(
@@ -481,10 +477,6 @@ def get_parts(
481477
if part.text:
482478
parts.append(part)
483479

484-
if meta_part.text and not node_is_group:
485-
# for DocItems add meta after content
486-
parts.append(meta_part)
487-
488480
return parts
489481

490482
@override

test/data/doc/2408.09869v3_enriched.gt.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
<!-- page break -->
44

5+
In this image, we can see some text and images.
6+
57
Figure 1: Sketch of Docling's default processing pipeline. The inner part of the model pipeline is easily customizable and extensible.
68

79
<!-- image -->
810

9-
In this image, we can see some text and images.
10-
1111
licensing (e.g. pymupdf [7]), poor speed or unrecoverable quality issues, such as merged text cells across far-apart text tokens or table columns (pypdfium, PyPDF) [15, 14].
1212

1313
We therefore decided to provide multiple backend choices, and additionally open-source a custombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made available in a separate package named docling-parse and powers the default PDF backend in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may be a safe backup choice in certain cases, e.g. if issues are seen with particular font encodings.

test/data/doc/2408.09869v3_enriched_p1_include_annotations_false.gt.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
<!-- image -->
44

5-
In this image we can see a cartoon image of a duck holding a paper.
6-
75
Version 1.0
86

97
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
@@ -30,8 +28,6 @@ Table 1: Runtime characteristics of Docling with the standard model pipeline and
3028
| Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
3129
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
3230

33-
{'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
34-
3531
## 5 Applications
3632

3733
Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.

test/data/doc/2408.09869v3_enriched_p1_mark_annotations_false.gt.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Docling Technical Report
22

3-
<!-- image -->
4-
53
In this image we can see a cartoon image of a duck holding a paper.
64

5+
<!-- image -->
6+
77
Version 1.0
88

99
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
@@ -33,8 +33,6 @@ Table 1: Runtime characteristics of Docling with the standard model pipeline and
3333
| Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
3434
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
3535

36-
{'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
37-
3836
## 5 Applications
3937

4038
Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.

test/data/doc/2408.09869v3_enriched_p1_mark_meta_true.gt.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Docling Technical Report
22

3-
<!-- image -->
4-
53
[Description] In this image we can see a cartoon image of a duck holding a paper.
64

5+
<!-- image -->
6+
77
Version 1.0
88

99
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
@@ -22,6 +22,8 @@ With Docling , we open-source a very capable and efficient document conversion t
2222

2323
torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.
2424

25+
[Docling Legacy Misc] {'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
26+
2527
summary: Typical Docling setup runtime characterization.
2628
type: performance data
2729

@@ -33,8 +35,6 @@ Table 1: Runtime characteristics of Docling with the standard model pipeline and
3335
| Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
3436
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
3537

36-
[Docling Legacy Misc] {'summary': 'Typical Docling setup runtime characterization.', 'type': 'performance data'}
37-
3838
## 5 Applications
3939

4040
Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.

test/data/doc/barchart.gt.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
Bar chart
2+
13
<!-- image -->
24

35
| Number of impellers | single-frequency | multi-frequency |
@@ -8,5 +10,3 @@
810
| 4 | 0.14 | 0.26 |
911
| 5 | 0.16 | 0.25 |
1012
| 6 | 0.24 | 0.24 |
11-
12-
Bar chart

test/data/doc/dummy_doc.yaml.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
11
# DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
22

3-
Figure 1: Four examples of complex page layouts across different document categories
4-
5-
<!-- image -->
6-
73
...
84

95
Bar chart
@@ -12,6 +8,10 @@ CC1=NNC(C2=CN3C=CN=C3C(CC3=CC(F)=CC(F)=C3)=N2)=N1
128

139
{'myanalysis': {'prediction': 'abc'}, 'something_else': {'text': 'aaa'}}
1410

11+
Figure 1: Four examples of complex page layouts across different document categories
12+
13+
<!-- image -->
14+
1515
A description annotation for this table.
1616

1717
{'foo': 'bar'}

test/data/doc/group_with_metadata_default.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ This is some introductory text.
88

99
This section talks about foo.
1010

11-
Regarding foo...
12-
1311
This paragraph provides more details about foo.
1412

13+
Regarding foo...
14+
1515
Here some foo specifics are listed.
1616

1717
1. lorem

test/data/doc/group_with_metadata_marked.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ This is some introductory text.
88

99
[Summary] This section talks about foo.
1010

11-
Regarding foo...
12-
1311
[Summary] This paragraph provides more details about foo.
1412

13+
Regarding foo...
14+
1515
[Summary] Here some foo specifics are listed.
1616

1717
1. lorem

test/test_serialization.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -263,14 +263,15 @@ def test_md_list_item_markers():
263263
)
264264

265265

266-
def test_md_include_annotations_false():
266+
def test_md_legacy_include_annotations_false():
267267
src = Path("./test/data/doc/2408.09869v3_enriched.json")
268268
doc = DoclingDocument.load_from_json(src)
269269

270270
ser = MarkdownDocSerializer(
271271
doc=doc,
272272
table_serializer=CustomAnnotationTableSerializer(),
273273
params=MarkdownParams(
274+
use_legacy_annotations=True,
274275
include_annotations=False,
275276
pages={1, 5},
276277
),
@@ -282,14 +283,15 @@ def test_md_include_annotations_false():
282283
)
283284

284285

285-
def test_md_mark_annotations_false():
286+
def test_md_legacy_mark_annotations_false():
286287
src = Path("./test/data/doc/2408.09869v3_enriched.json")
287288
doc = DoclingDocument.load_from_json(src)
288289

289290
ser = MarkdownDocSerializer(
290291
doc=doc,
291292
table_serializer=CustomAnnotationTableSerializer(),
292293
params=MarkdownParams(
294+
use_legacy_annotations=True,
293295
include_annotations=True,
294296
mark_annotations=False,
295297
pages={1, 5},
@@ -310,7 +312,6 @@ def test_md_mark_meta_true():
310312
doc=doc,
311313
table_serializer=CustomAnnotationTableSerializer(),
312314
params=MarkdownParams(
313-
include_annotations=True,
314315
mark_meta=True,
315316
pages={1, 5},
316317
),
@@ -322,7 +323,7 @@ def test_md_mark_meta_true():
322323
)
323324

324325

325-
def test_md_use_legacy_annotations_true_mark_annotations_true():
326+
def test_md_legacy_mark_annotations_true():
326327
src = Path("./test/data/doc/2408.09869v3_enriched.json")
327328
doc = DoclingDocument.load_from_json(src)
328329

0 commit comments

Comments
 (0)