Skip to content

[Bug]: GraphRAG resolution fails with TOO_MANY_CONNECTIONS (Infinity ErrorCode 5003) during dataset-scope operations #14137

@prpercival

Description

@prpercival

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

5923365ef987fb328c977d07c788bc775732bc64bc96140e643a588ddd68ff92

RAGFlow image version

v0.24.0-456-g52442c8eb

Other environment information

- **Hardware**: Kubernetes (bare-metal, Talos Linux), 8 CPU / 16 GiB RAM allocated to Infinity StatefulSet
- **OS type**: Talos Linux (Kubernetes nodes), containers on Debian-based images
- **Document engine**: Infinity v0.7.0-dev5 (`infiniflow/infinity:v0.7.0-dev5`)
- **LLM**: Azure OpenAI (gpt-5.4), Embedding: text-embedding-3-large
- **Dataset**: 28 documents (JSON schemas, ~584 bytes to multi-MB), KB ID `774ce450198711f1b748d19e18b9e406`
- **Infinity config**: `connection_pool_size = 512` (bumped from default 128)

Actual behavior

When running dataset-scope GraphRAG with resolution: true and community: true across 28 documents, the graph resolution step fails with:

20:31:59 Resolved 5694 candidate pairs, 333 of them are selected to merge.
20:32:31 Graph resolution removed 294 nodes and 1808 edges.
20:32:31 Graph resolution updated pagerank.
20:32:31 [ERROR][Exception]: (<ErrorCode.TOO_MANY_CONNECTIONS: 5003>, 'Try 10 times, but still failed')

The error occurs consistently at the end of graph resolution, after the merge/delete/pagerank operations complete. The resolution itself succeeds (294 nodes removed, 1808 edges removed, pagerank updated), but the subsequent Infinity write operations (likely the community detection step or final graph persist) exhaust the connection pool.

Prometheus metrics from the Infinity pod during the failure:

Time (CDT) Infinity CPU (cores) Notes
18:02 8.0 Peak CPU — 2x the 4-core limit at the time, severe CFS throttling
20:12 5.8 Still saturated during graph resolution
20:27 2.7 Resolution winding down
20:32 0.001 Infinity goes idle — RAGFlow client gave up after TOO_MANY_CONNECTIONS

The root cause appears to be Kubernetes CFS throttling of the Infinity container compounding with the bursty nature of graph resolution writes. When Infinity is CPU-throttled, connections queue up server-side waiting for CPU time slices. The resolution burst (333 merges → 294 node deletes + 1808 edge deletes + pagerank update) causes queued connections to exceed the connection_pool_size limit.

Key observations:

  • The RAGFlow client-side connection pool (infinity_conn_pool.py) is hardcoded to max_size=4 (initial) / max_size=32 (on refresh) — there is no user-configurable option to increase this
  • The server-side connection_pool_size of 512 was already 4x the default of 128, but was still insufficient
  • The error retries 10 times before giving up, suggesting the connection backlog persists for an extended period

Expected behavior

Graph resolution and community detection should complete successfully without hitting connection limits, or should gracefully handle connection exhaustion (e.g., exponential backoff with longer retry windows, or serializing the write operations to reduce concurrent connection demand).

Steps to reproduce

1. Create a knowledge base with 28+ documents (JSON schema files work well to reproduce)
2. Enable GraphRAG with `resolution: true` and `community: true` at the dataset level
3. Enable RAPTOR with `scope: dataset`
4. Parse all documents
5. Trigger dataset-scope GraphRAG (resolution + community)
6. Observe the Knowledge Graph progress panel: resolution resolves ~5000+ candidate pairs, then fails with `TOO_MANY_CONNECTIONS: 5003`
7. The Infinity `connection_pool_size` is 512 (4x default)

Additional information

Workaround: Increase Infinity CPU limits to 8 cores and increase connection_pool_size to 2048. This gives Infinity enough CPU headroom to process connections without CFS throttling causing stale connection pile-up.

Feature request: Expose the RAGFlow-side Infinity client connection pool size (max_size in ConnectionPool()) as an environment variable so operators can tune it independently of the server-side pool. Currently it is hardcoded in common/doc_store/infinity_conn_pool.py:

conn_pool = ConnectionPool(self.infinity_uri, max_size=4)     # initial
self.conn_pool = ConnectionPool(self.infinity_uri, max_size=32)  # on refresh

Related issue: #8706, #12006

Metadata

Metadata

Assignees

Labels

♾️infinityPull requests that‘s involved with infinity(DB)🐞 bugSomething isn't working, pull request that fix bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions