Skip to content

[Bug]: StreamingNode crashes in segcore retrieve during concurrent query and segment transfer #50366

@zhuwenxing

Description

@zhuwenxing

Existing Issue

  • I have searched the existing issues

Environment

  • Milvus version: unknown-20260604-825016cbd7
  • Git commit: 825016c
  • Build time: Thu Jun 4 23:15:39 UTC 2026
  • Deployment mode: cluster
  • MQ type: pulsar
  • SDK version: unknown
  • OS: Kubernetes / 4am cluster
  • CPU/Memory: GOMAXPROCS=4, TotalMem=17179869184
  • GPU: N/A
  • Others: namespace chaos-testing, instance pulsar-cluster-reinstall-3801

K8s pod list at 2026-06-05 16:44:58 UTC showed the target pod restarted 3 times:

pulsar-cluster-reinstall-3801-milvus-streamingnode-59d6c4d4bp78   1/1   Running   3 (18m ago)   32m   10.104.30.192   4am-node38
pulsar-cluster-reinstall-3801-milvus-streamingnode-59d6c4dt52xj   1/1   Running   2 (32m ago)   32m   10.104.14.238   4am-node18

Current Behavior

A StreamingNode pod restarted after running for a while. Loki logs show that the first two restarts were startup-time failures caused by etcd not being ready:

2026/06/05 16:12:10 panic: failed to create etcd client: context deadline exceeded
2026/06/05 16:12:23 panic: failed to create etcd client: context deadline exceeded

The later runtime restart was different. At 2026-06-05 16:26:18 UTC, the process crashed in the C++ segcore retrieve path while filling query result target fields from a growing FloatVector segment.

No Loki evidence was found for OOMKilled, liveness/readiness probe failure, BackOff, or Kubernetes killing the container.

Expected Behavior

StreamingNode should not crash during concurrent query/retrieve and segment load/release/delete-snapshot transfer. It should either complete the query safely or return an error without terminating the process.

Steps To Reproduce

The exact minimal reproducer is not available yet. The observed CI/test workload had the following pattern:

  1. Deploy a Milvus cluster with Pulsar MQ using commit 825016cbd7.
  2. Run concurrent query/search workload while collections are actively receiving inserts/deletes/flushes.
  3. Trigger load/release segment transfer on StreamingNode.
  4. Observe StreamingNode crash during a retrieve/query path.

The relevant workload window involved collection 466793519583008866 and channel by-dev-rootcoord-dml_12_466793519583008866v1.

Milvus Log

Runtime crash context:

[2026/06/05 16:26:17.892 +00:00] [INFO] [querynodev2/services.go:443]
["received load segments request"] [traceID=0a357e11364b4ad94a2ec58fd54df871]
[collectionID=466793519583008866] [segmentID=466793519703471716]
[currentNodeID=2] [dstNodeID=9] [needTransfer=true] [loadScope=Full]

[2026/06/05 16:26:18.484 +00:00] [INFO] [delegator/delegator_data.go:855]
["forward delete to worker (phase 2: snapshot)..."]
[collectionID=466793519583008866] [segmentID=466793519703471716]
[tsHitDeleteRowNum=1497] [bfHitDeleteRowNum=1497]

[2026/06/05 16:26:18.487 +00:00] [INFO] [delegator/delegator_data.go:909]
["load stream delete done"] [collectionID=466793519583008866]

Native stack excerpt:

_ZNK6milvus7segcore18SegmentGrowingImpl19bulk_subscript_implINS_11FloatVectorEEEv...
../../../internal/core/src/segcore/SegmentGrowingImpl.cpp:1472

_ZNK6milvus7segcore18SegmentGrowingImpl14bulk_subscript...
../../../internal/core/src/segcore/SegmentGrowingImpl.cpp:1155

_ZNK6milvus7segcore24SegmentInternalInterface15FillTargetEntry...
../../../internal/core/src/segcore/SegmentInterface.cpp:239

_ZNK6milvus7segcore24SegmentInternalInterface8Retrieve...
../../../internal/core/src/segcore/SegmentInterface.cpp:163

AsyncRetrieve(...)
../../../internal/core/src/segcore/segment_c.cpp:374

Nearby query workload:

[2026/06/05 16:26:03.477 +00:00] ["received query request"]
[traceID=dd790501e2d1eb73d28a93ec42fd082c]
[collectionID=466793519583008866]
[outputFields="[115,118,112,109,127,100,102,101,110,129,117,106,120,111,124,119,107,125,113,104,105,126,108,121,123,128,122,103,1]"]
[segmentIDs="[]"]

[2026/06/05 16:26:05.554 +00:00] ["received query request"]
[traceID=70500c4304887481b03adcb9aaae2abb]
[collectionID=466793519583008866]
[outputFields="[129,100,1]"]

Anything else?

The crash stack points to SegmentGrowingImpl::bulk_subscript_impl<FloatVector> reading raw vector data by physical offset during FillTargetEntry. The surrounding logs suggest a concurrency/lifecycle issue around query/retrieve plus load/release segment transfer and delete snapshot loading, rather than an external Kubernetes restart.

Grafana links:

Metadata

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.severity/criticalCritical, lead to crash, data missing, wrong result, function totally doesn't work.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions