Huge collection and perfomance for qdrant cluster #7984
kimmy-github
started this conversation in
General
Replies: 2 comments 2 replies
-
|
what exactly do you call a backup? Are you using managed/hybrid cloud? |
Beta Was this translation helpful? Give feedback.
2 replies
-
|
Being unsure yet what environment you run in: on big deployments its recommended to create a disk level snapshot to backup. That should be supported on all major cloud providers. Note that I do not mean a Qdrant snapshot, but a disk/filesystem level snapshot. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently, I ran into an issue in my Qdrant cluster. One of the collections has grown to nearly 400 GB. Every day at 2:00 a.m., I trigger a backup, and during that process the system reports warnings like these:
qdrant | 2026-01-25T18:03:54.811658Z WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 1 from 3903034538002768 to 7761821500842248
qdrant | 2026-01-25T18:04:03.460369Z WARN storage::content_manager::consensus_manager: Failed to send message to http://10.10.1.2:6335/ with error: Error in closure supplied to transport channel pool: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
After checking the monitoring data, I noticed that during the backup window the combined disk read and write throughput exceeds 800 MB/s, and disk I/O wait time can peak around 13 ms.
So I have a few questions. First, how can I check the timeout setting for this kind of consensus_manager—in other words, where is the timeout configured and how can I inspect it? Second, is a 400 GB collection considered too large for Qdrant in practice? Finally, are there any good approaches to optimize the system so backups don’t trigger these errors or timeouts?
Beta Was this translation helpful? Give feedback.
All reactions