Suggestion/Question: Include page information in output of extraction and chunking #211

jlegewie · 2025-12-03T16:43:56Z

jlegewie
Dec 3, 2025

Kreuzberg looks great. Particularly version 4!

I understand that Kreuzberg does not support bounding boxes or spans. However, I do think it should support pages as a coarse locator for both the general extraction pipeline and for chunking.

For extraction pipeline, it would be great to have the option to either include page separators in the output or return a list with content per page.

For chunking, the chunk output object should include information on pages (either as first_page + last_page, a list or just first_page).

At least for our application, this information is essential so that users can go back to the original source.

Goldziher · 2025-12-03T16:52:48Z

Goldziher
Dec 3, 2025
Maintainer

you mean as part of the output - in markdown or plaintext some sort of line break identifier as a markdown comment? dunno how to handle this for text.

For chunking, it makes sense.

Let me ask a question though - we do identify and extra page number from document metadata. E.g. you might have a PDF with page metadata that factors in front-matter (TOC, references, preface, etc.), with differnt page numbering. We need to preserve this metadata, since for some types of documents its important (e.g. academic books, where you want to have precise references).

We need to have clear semantics that separate between these, or clear heuristics how to handle this information - but i think having two different values makes more sense.

0 replies

jlegewie · 2025-12-03T18:17:21Z

jlegewie
Dec 3, 2025
Author

Yes, exactly, as part of the output. Some document processing packages solve this by including an (optional) custom page marker that is easy to parse. Others return content as a list of strings.

I think this is separate from page numbers from document metadata. Here are some specific applications that benefit from page locators:

When the consumer is an LLM, we pass the document with some markup such as <page number="4">...</page><page number="5">...</page>. With that markup, the model can reference specific pages and we can use that to support a "Open at page" feature.
For chunking, I want the same. We get specific chunks via retrieval, pass them to the model and the model can cite specific chunks. I want to show the user both the chunk and "Open at page".
For agentic applications, passing large documents back to the model can be very token inefficient. One way to address that is to allow the model to decide which pages to read, which again requires content by page.

1 reply

Goldziher Dec 4, 2025
Maintainer

Ok, yes, I'll add this to the roadmap.

You can open an issue with what you'd like to see.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kreuzberg

Suggestion/Question: Include page information in output of extraction and chunking #211

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

kreuzberg

Suggestion/Question: Include page information in output of extraction and chunking #211

Uh oh!

jlegewie Dec 3, 2025

Replies: 2 comments · 1 reply

Uh oh!

Goldziher Dec 3, 2025 Maintainer

Uh oh!

jlegewie Dec 3, 2025 Author

Uh oh!

Goldziher Dec 4, 2025 Maintainer

jlegewie
Dec 3, 2025

Replies: 2 comments 1 reply

Goldziher
Dec 3, 2025
Maintainer

jlegewie
Dec 3, 2025
Author

Goldziher Dec 4, 2025
Maintainer