Replies: 2 comments 1 reply
-
|
you mean as part of the output - in markdown or plaintext some sort of line break identifier as a markdown comment? dunno how to handle this for text. For chunking, it makes sense. Let me ask a question though - we do identify and extra page number from document metadata. E.g. you might have a PDF with page metadata that factors in front-matter (TOC, references, preface, etc.), with differnt page numbering. We need to preserve this metadata, since for some types of documents its important (e.g. academic books, where you want to have precise references). We need to have clear semantics that separate between these, or clear heuristics how to handle this information - but i think having two different values makes more sense. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, exactly, as part of the output. Some document processing packages solve this by including an (optional) custom page marker that is easy to parse. Others return content as a list of strings. I think this is separate from page numbers from document metadata. Here are some specific applications that benefit from page locators:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Kreuzberg looks great. Particularly version 4!
I understand that Kreuzberg does not support bounding boxes or spans. However, I do think it should support pages as a coarse locator for both the general extraction pipeline and for chunking.
For extraction pipeline, it would be great to have the option to either include page separators in the output or return a list with content per page.
For chunking, the chunk output object should include information on pages (either as first_page + last_page, a list or just first_page).
At least for our application, this information is essential so that users can go back to the original source.
Beta Was this translation helpful? Give feedback.
All reactions