Skip to content

[Feature Request]: RAPTOR indexing needs checkpointing or per-document granularity - Single API timeout causes entire task failure #11483

@chg387387

Description

@chg387387

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

Is your feature request related to a problem?

The problem is that RAPTOR indexing currently runs as a single, long-running monolithic task for the whole batch. When using remote APIs (like SiliconFlow) for large Knowledge Bases (100+ files), the process can take 10+ hours. If a network fluctuation occurs at the 5th hour, the entire job fails, and all progress is lost ("all-or-nothing"). There is no checkpointing or per-document granularity, making it extremely difficult and expensive to use RAPTOR on large datasets.

Describe the feature you'd like

I propose granularizing the RAPTOR task execution logic. Instead of treating the entire RAPTOR generation as one giant task, please implement one of the following:

Per-Document Checkpointing: Save the RAPTOR index progress after each document or small batch is processed. If the task fails, allow resuming from the last successful document instead of restarting from zero.

Independent Tasks: Decouple RAPTOR generation so it runs independently for each file (similar to how file parsing works). If one file fails due to an API timeout, only that specific file should be marked as 'Error', while others continue to completion.

Better Error Handling: If an API call fails mid-process, the system should pause or retry that specific chunk, rather than crashing the entire batch job.

Describe implementation you've considered

I strongly suggest reverting to the previous logic where RAPTOR was processed per individual file during the parsing stage, which isolated failures to specific documents.

Alternatively, if batch processing is necessary, implementing intermediate checkpoints is crucial. If an error occurs (e.g., API timeout after 5 hours), restarting the task must resume from the last successful point rather than starting over from scratch. Without these mechanisms, the current implementation is too fragile for production use with unstable APIs.

Documentation, adoption, use case

Environment: RAGFlow v0.22.0 / v0.22.1 LLM Service: SiliconFlow API (DeepSeek/Qwen) Dataset: 100+ PDF documents

Impact: This issue is particularly critical for users relying on paid API tokens. When a task fails after 6 hours due to a single timeout, 6 hours worth of API tokens are permanently wasted with no result. The lack of resilience makes RAPTOR currently viable only for very small KBs or local models, limiting its potential for serious academic or enterprise use cases.

Additional information

No response

Metadata

Metadata

Labels

💞 featureFeature request, pull request that fullfill a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions