Question about General Chunking Behavior for Markdown Documents #13738

Sitri-u · 2026-03-23T06:12:29Z

Sitri-u
Mar 23, 2026

Hello,

I am currently using system version 0.24.0 to build a dataset, and I am experiencing some confusion regarding how the system handles text splitting. I would greatly appreciate your guidance on this matter.

My current setup and expectations are as follows:

Splitting Method: General
Document Format: Markdown
Text Structure: Each section uses # Section: as a title prefix. The # symbol appears only within titles and does not appear in the main body of the text.
Section Identifier: # (or # Section:)
Suggested Chunk Size: I have experimented with various values, ranging from 500 down to 7
I would like the document to be split strictly by section (i.e., each block starting with # Section: should form one complete chunk), rather than being split based on token length
Observed Behavior: The actual result is that the title (e.g., # Section: ...) is separated into its own standalone chunk, while the following body content is further split into multiple chunks. This occurs even when the "Suggested Chunk Size" is set to a very small value (e.g., 7)

Additionally, I would like to mention that in a previous version (0.22.0), I was able to achieve the desired behavior by using # as the only delimiter and setting a small chunk size (e.g., around 20). In that version, the system correctly split the document according to section boundaries without breaking the semantic integrity of each section. However, this behavior no longer seems to work in version 0.24.0.

Based on the situation described above, I would like to ask:

In the "General" splitting method, what is the priority relationship between the "Section Identifier" and the "Suggested Chunk Size"? Is it possible to configure or control this priority?
Does the current system enforce a mandatory token-length-based splitting mechanism, even when a "Section Identifier" is specified?
Is there a supported mode for splitting strictly based on section identifiers (i.e., one section = one complete chunk, without secondary splitting)?
For structured Markdown documents (titles followed by body content), is there a recommended best-practice configuration for chunking?

My use case requires preserving the semantic integrity of each section for downstream retrieval and QA tasks. Therefore, splitting titles and their corresponding content into separate chunks is not desirable.

If this behavior is expected by design, I would also appreciate any recommended alternative approaches or workarounds.

Thank you very much for your support.

Best regards,
Sitri LIU

Sitri-u · 2026-03-23T06:19:33Z

dosubot[bot]
bot Mar 23, 2026

Hi @llllxxyy! I'm Dosu and I’m helping the ragflow team.

Hi Sitri,

Great question! The behavior change you're experiencing is actually due to a significant update between versions 0.22.0 and 0.24.0.

What changed: PR #11434 introduced a fundamental change where custom delimiters wrapped in backticks now completely bypass chunk_token_num. This means when you use a custom delimiter like `# Section:`, the system splits only on that delimiter without any token-based merging.

The solution for your use case: Try wrapping your section identifier in backticks in the delimiter field:

`# Section:`

With this format, RAGFlow should split your document strictly by section boundaries, keeping each # Section: block as a complete chunk without secondary splitting based on token length.

To answer your specific questions:

Priority relationship: When using custom delimiters (in backticks), chunk_token_num is completely ignored. Without backticks, the system first splits by delimiters then merges small sections based on token count.
Mandatory token-based splitting: No—when using backticked custom delimiters, token-based merging is bypassed entirely.
Strict section-based splitting: Yes, use custom delimiters wrapped in backticks (e.g., `# Section:`).
Best practice for structured Markdown: Use header markers as custom delimiters in backticks. If you have multiple heading levels, list delimiters in descending order by length (e.g., `###`, `##`, `#`).

Important notes:

Multi-character delimiters MUST be wrapped in backticks
Changes to delimiter settings only affect newly uploaded files—you'll need to re-upload after changing configuration
Colons (:) are hardcoded delimiters that may cause unexpected splits in some cases

If the backtick approach doesn't work for your specific setup, you might also try the "qa" chunking method, which treats each delimited section as an individual chunk without any merging behavior.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

Sitri-u Mar 23, 2026
Author

Hi @dosu,

Thank you so much for providing such a clear and detailed explanation!

In my case, the approach of wrapping the separator (e.g., # Section:) in backticks worked perfectly. The content now splits precisely along the section boundaries, and the issues I was encountering previously have been completely resolved.

Additionally, I really appreciate your explanation regarding the behavioral changes between versions 0.22.0 and 0.24.0—this helped me gain a better understanding of the underlying logic behind them.

Thanks again for your support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

Question about General Chunking Behavior for Markdown Documents #13738

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

Question about General Chunking Behavior for Markdown Documents #13738

Uh oh!

Sitri-u Mar 23, 2026

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Mar 23, 2026

Uh oh!

Sitri-u Mar 23, 2026 Author

Sitri-u
Mar 23, 2026

Replies: 1 comment 1 reply

dosubot[bot]
bot Mar 23, 2026

Sitri-u Mar 23, 2026
Author