Reading order for pseudo-OCR pre-training task

I would like to train the base model for a few more epochs on the pre-training pseudo-OCR task using a custom dataset. In what reading order should the individual words of the document image be passed to the model? The Donut paper states:

> The model is trained to read all texts in the image in reading order (from top-left to bottom-right, basically). [...] This task can be interpreted as a pseudo-OCR task.

What does "top-left to bottom-right" mean for multi-column text? For instance, consider the attached dummy document with one heading and two text columns:
![000a_readingorder](https://github.com/user-attachments/assets/076e2ff6-6b5f-4278-ae6b-a2e233296ccf)
Should the document be transcribed as:

- _Word1 Col1w1 Col1w2 Col2w1 Col2w2_, or
- _Word1 Col1w1 Col2w1 Col1w2 Col2w2_ ?

I imagine that any dataset used for the pre-training pseudo-OCR task should adopt the same reading order policy as the pe-trained Donut base model. Unfortunately, I am not able to find any information of the exact implementation of "top-left to bottom-right", neither in the paper, the paper supplement, nor the source code. After all, "top-left to bottom-right" can be interpreted in different ways:
- top-to-bottom, left-to-right
- left-to-right, top-to-bottom
- clustering of words into text blocks to mimic semantically meaningful text paragraphs
- etc.

@gwkrsrch can you provide any guidance in this regard, please?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading order for pseudo-OCR pre-training task #324

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reading order for pseudo-OCR pre-training task #324

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions