Replies: 1 comment 1 reply
-
|
Hi @llllxxyy! I'm Dosu and I’m helping the ragflow team. Hi Sitri, Great question! The behavior change you're experiencing is actually due to a significant update between versions 0.22.0 and 0.24.0. What changed: PR #11434 introduced a fundamental change where custom delimiters wrapped in backticks now completely bypass The solution for your use case: Try wrapping your section identifier in backticks in the delimiter field: With this format, RAGFlow should split your document strictly by section boundaries, keeping each To answer your specific questions:
Important notes:
If the backtick approach doesn't work for your specific setup, you might also try the "qa" chunking method, which treats each delimited section as an individual chunk without any merging behavior. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am currently using system version 0.24.0 to build a dataset, and I am experiencing some confusion regarding how the system handles text splitting. I would greatly appreciate your guidance on this matter.
My current setup and expectations are as follows:
# Section:as a title prefix. The#symbol appears only within titles and does not appear in the main body of the text.#(or# Section:)# Section:should form one complete chunk), rather than being split based on token length# Section: ...) is separated into its own standalone chunk, while the following body content is further split into multiple chunks. This occurs even when the "Suggested Chunk Size" is set to a very small value (e.g., 7)Additionally, I would like to mention that in a previous version (0.22.0), I was able to achieve the desired behavior by using
#as the only delimiter and setting a small chunk size (e.g., around 20). In that version, the system correctly split the document according to section boundaries without breaking the semantic integrity of each section. However, this behavior no longer seems to work in version 0.24.0.Based on the situation described above, I would like to ask:
My use case requires preserving the semantic integrity of each section for downstream retrieval and QA tasks. Therefore, splitting titles and their corresponding content into separate chunks is not desirable.
If this behavior is expected by design, I would also appreciate any recommended alternative approaches or workarounds.
Thank you very much for your support.
Best regards,
Sitri LIU
Beta Was this translation helpful? Give feedback.
All reactions