Self Checks
RAGFlow workspace code commit ID
5923365ef987fb328c977d07c788bc775732bc64bc96140e643a588ddd68ff92
RAGFlow image version
v0.24.0-456-g52442c8eb
Other environment information
- **Hardware**: Kubernetes (bare-metal, Talos Linux), 8 CPU / 16 GiB RAM allocated to Infinity StatefulSet
- **OS type**: Talos Linux (Kubernetes nodes), containers on Debian-based images
- **Document engine**: Infinity v0.7.0-dev5 (`infiniflow/infinity:v0.7.0-dev5`)
- **LLM**: Azure OpenAI (gpt-5.4), Embedding: text-embedding-3-large
- **Dataset**: 28 documents (JSON schemas, ~584 bytes to multi-MB), KB ID `774ce450198711f1b748d19e18b9e406`
- **Infinity config**: `connection_pool_size = 512` (bumped from default 128)
Actual behavior
When running dataset-scope GraphRAG with resolution: true and community: true across 28 documents, the graph resolution step fails with:
20:31:59 Resolved 5694 candidate pairs, 333 of them are selected to merge.
20:32:31 Graph resolution removed 294 nodes and 1808 edges.
20:32:31 Graph resolution updated pagerank.
20:32:31 [ERROR][Exception]: (<ErrorCode.TOO_MANY_CONNECTIONS: 5003>, 'Try 10 times, but still failed')
The error occurs consistently at the end of graph resolution, after the merge/delete/pagerank operations complete. The resolution itself succeeds (294 nodes removed, 1808 edges removed, pagerank updated), but the subsequent Infinity write operations (likely the community detection step or final graph persist) exhaust the connection pool.
Prometheus metrics from the Infinity pod during the failure:
| Time (CDT) |
Infinity CPU (cores) |
Notes |
| 18:02 |
8.0 |
Peak CPU — 2x the 4-core limit at the time, severe CFS throttling |
| 20:12 |
5.8 |
Still saturated during graph resolution |
| 20:27 |
2.7 |
Resolution winding down |
| 20:32 |
0.001 |
Infinity goes idle — RAGFlow client gave up after TOO_MANY_CONNECTIONS |
The root cause appears to be Kubernetes CFS throttling of the Infinity container compounding with the bursty nature of graph resolution writes. When Infinity is CPU-throttled, connections queue up server-side waiting for CPU time slices. The resolution burst (333 merges → 294 node deletes + 1808 edge deletes + pagerank update) causes queued connections to exceed the connection_pool_size limit.
Key observations:
- The RAGFlow client-side connection pool (
infinity_conn_pool.py) is hardcoded to max_size=4 (initial) / max_size=32 (on refresh) — there is no user-configurable option to increase this
- The server-side
connection_pool_size of 512 was already 4x the default of 128, but was still insufficient
- The error retries 10 times before giving up, suggesting the connection backlog persists for an extended period
Expected behavior
Graph resolution and community detection should complete successfully without hitting connection limits, or should gracefully handle connection exhaustion (e.g., exponential backoff with longer retry windows, or serializing the write operations to reduce concurrent connection demand).
Steps to reproduce
1. Create a knowledge base with 28+ documents (JSON schema files work well to reproduce)
2. Enable GraphRAG with `resolution: true` and `community: true` at the dataset level
3. Enable RAPTOR with `scope: dataset`
4. Parse all documents
5. Trigger dataset-scope GraphRAG (resolution + community)
6. Observe the Knowledge Graph progress panel: resolution resolves ~5000+ candidate pairs, then fails with `TOO_MANY_CONNECTIONS: 5003`
7. The Infinity `connection_pool_size` is 512 (4x default)
Additional information
Workaround: Increase Infinity CPU limits to 8 cores and increase connection_pool_size to 2048. This gives Infinity enough CPU headroom to process connections without CFS throttling causing stale connection pile-up.
Feature request: Expose the RAGFlow-side Infinity client connection pool size (max_size in ConnectionPool()) as an environment variable so operators can tune it independently of the server-side pool. Currently it is hardcoded in common/doc_store/infinity_conn_pool.py:
conn_pool = ConnectionPool(self.infinity_uri, max_size=4) # initial
self.conn_pool = ConnectionPool(self.infinity_uri, max_size=32) # on refresh
Related issue: #8706, #12006
Self Checks
RAGFlow workspace code commit ID
5923365ef987fb328c977d07c788bc775732bc64bc96140e643a588ddd68ff92
RAGFlow image version
v0.24.0-456-g52442c8eb
Other environment information
Actual behavior
When running dataset-scope GraphRAG with
resolution: trueandcommunity: trueacross 28 documents, the graph resolution step fails with:The error occurs consistently at the end of graph resolution, after the merge/delete/pagerank operations complete. The resolution itself succeeds (294 nodes removed, 1808 edges removed, pagerank updated), but the subsequent Infinity write operations (likely the community detection step or final graph persist) exhaust the connection pool.
Prometheus metrics from the Infinity pod during the failure:
The root cause appears to be Kubernetes CFS throttling of the Infinity container compounding with the bursty nature of graph resolution writes. When Infinity is CPU-throttled, connections queue up server-side waiting for CPU time slices. The resolution burst (333 merges → 294 node deletes + 1808 edge deletes + pagerank update) causes queued connections to exceed the
connection_pool_sizelimit.Key observations:
infinity_conn_pool.py) is hardcoded tomax_size=4(initial) /max_size=32(on refresh) — there is no user-configurable option to increase thisconnection_pool_sizeof 512 was already 4x the default of 128, but was still insufficientExpected behavior
Graph resolution and community detection should complete successfully without hitting connection limits, or should gracefully handle connection exhaustion (e.g., exponential backoff with longer retry windows, or serializing the write operations to reduce concurrent connection demand).
Steps to reproduce
Additional information
Workaround: Increase Infinity CPU limits to 8 cores and increase
connection_pool_sizeto 2048. This gives Infinity enough CPU headroom to process connections without CFS throttling causing stale connection pile-up.Feature request: Expose the RAGFlow-side Infinity client connection pool size (
max_sizeinConnectionPool()) as an environment variable so operators can tune it independently of the server-side pool. Currently it is hardcoded incommon/doc_store/infinity_conn_pool.py:Related issue: #8706, #12006