Existing Issue
Environment
- Milvus version: unknown-20260604-825016cbd7
- Git commit: 825016c
- Build time: Thu Jun 4 23:15:39 UTC 2026
- Deployment mode: cluster
- MQ type: pulsar
- SDK version: unknown
- OS: Kubernetes / 4am cluster
- CPU/Memory: GOMAXPROCS=4, TotalMem=17179869184
- GPU: N/A
- Others: namespace
chaos-testing, instance pulsar-cluster-reinstall-3801
K8s pod list at 2026-06-05 16:44:58 UTC showed the target pod restarted 3 times:
pulsar-cluster-reinstall-3801-milvus-streamingnode-59d6c4d4bp78 1/1 Running 3 (18m ago) 32m 10.104.30.192 4am-node38
pulsar-cluster-reinstall-3801-milvus-streamingnode-59d6c4dt52xj 1/1 Running 2 (32m ago) 32m 10.104.14.238 4am-node18
Current Behavior
A StreamingNode pod restarted after running for a while. Loki logs show that the first two restarts were startup-time failures caused by etcd not being ready:
2026/06/05 16:12:10 panic: failed to create etcd client: context deadline exceeded
2026/06/05 16:12:23 panic: failed to create etcd client: context deadline exceeded
The later runtime restart was different. At 2026-06-05 16:26:18 UTC, the process crashed in the C++ segcore retrieve path while filling query result target fields from a growing FloatVector segment.
No Loki evidence was found for OOMKilled, liveness/readiness probe failure, BackOff, or Kubernetes killing the container.
Expected Behavior
StreamingNode should not crash during concurrent query/retrieve and segment load/release/delete-snapshot transfer. It should either complete the query safely or return an error without terminating the process.
Steps To Reproduce
The exact minimal reproducer is not available yet. The observed CI/test workload had the following pattern:
- Deploy a Milvus cluster with Pulsar MQ using commit
825016cbd7.
- Run concurrent query/search workload while collections are actively receiving inserts/deletes/flushes.
- Trigger load/release segment transfer on StreamingNode.
- Observe StreamingNode crash during a retrieve/query path.
The relevant workload window involved collection 466793519583008866 and channel by-dev-rootcoord-dml_12_466793519583008866v1.
Milvus Log
Runtime crash context:
[2026/06/05 16:26:17.892 +00:00] [INFO] [querynodev2/services.go:443]
["received load segments request"] [traceID=0a357e11364b4ad94a2ec58fd54df871]
[collectionID=466793519583008866] [segmentID=466793519703471716]
[currentNodeID=2] [dstNodeID=9] [needTransfer=true] [loadScope=Full]
[2026/06/05 16:26:18.484 +00:00] [INFO] [delegator/delegator_data.go:855]
["forward delete to worker (phase 2: snapshot)..."]
[collectionID=466793519583008866] [segmentID=466793519703471716]
[tsHitDeleteRowNum=1497] [bfHitDeleteRowNum=1497]
[2026/06/05 16:26:18.487 +00:00] [INFO] [delegator/delegator_data.go:909]
["load stream delete done"] [collectionID=466793519583008866]
Native stack excerpt:
_ZNK6milvus7segcore18SegmentGrowingImpl19bulk_subscript_implINS_11FloatVectorEEEv...
../../../internal/core/src/segcore/SegmentGrowingImpl.cpp:1472
_ZNK6milvus7segcore18SegmentGrowingImpl14bulk_subscript...
../../../internal/core/src/segcore/SegmentGrowingImpl.cpp:1155
_ZNK6milvus7segcore24SegmentInternalInterface15FillTargetEntry...
../../../internal/core/src/segcore/SegmentInterface.cpp:239
_ZNK6milvus7segcore24SegmentInternalInterface8Retrieve...
../../../internal/core/src/segcore/SegmentInterface.cpp:163
AsyncRetrieve(...)
../../../internal/core/src/segcore/segment_c.cpp:374
Nearby query workload:
[2026/06/05 16:26:03.477 +00:00] ["received query request"]
[traceID=dd790501e2d1eb73d28a93ec42fd082c]
[collectionID=466793519583008866]
[outputFields="[115,118,112,109,127,100,102,101,110,129,117,106,120,111,124,119,107,125,113,104,105,126,108,121,123,128,122,103,1]"]
[segmentIDs="[]"]
[2026/06/05 16:26:05.554 +00:00] ["received query request"]
[traceID=70500c4304887481b03adcb9aaae2abb]
[collectionID=466793519583008866]
[outputFields="[129,100,1]"]
Anything else?
The crash stack points to SegmentGrowingImpl::bulk_subscript_impl<FloatVector> reading raw vector data by physical offset during FillTargetEntry. The surrounding logs suggest a concurrency/lifecycle issue around query/retrieve plus load/release segment transfer and delete snapshot loading, rather than an external Kubernetes restart.
Grafana links:
Existing Issue
Environment
chaos-testing, instancepulsar-cluster-reinstall-3801K8s pod list at
2026-06-05 16:44:58 UTCshowed the target pod restarted 3 times:Current Behavior
A StreamingNode pod restarted after running for a while. Loki logs show that the first two restarts were startup-time failures caused by etcd not being ready:
The later runtime restart was different. At
2026-06-05 16:26:18 UTC, the process crashed in the C++ segcore retrieve path while filling query result target fields from a growing FloatVector segment.No Loki evidence was found for OOMKilled, liveness/readiness probe failure, BackOff, or Kubernetes killing the container.
Expected Behavior
StreamingNode should not crash during concurrent query/retrieve and segment load/release/delete-snapshot transfer. It should either complete the query safely or return an error without terminating the process.
Steps To Reproduce
The exact minimal reproducer is not available yet. The observed CI/test workload had the following pattern:
825016cbd7.The relevant workload window involved collection
466793519583008866and channelby-dev-rootcoord-dml_12_466793519583008866v1.Milvus Log
Runtime crash context:
Native stack excerpt:
Nearby query workload:
Anything else?
The crash stack points to
SegmentGrowingImpl::bulk_subscript_impl<FloatVector>reading raw vector data by physical offset duringFillTargetEntry. The surrounding logs suggest a concurrency/lifecycle issue around query/retrieve plus load/release segment transfer and delete snapshot loading, rather than an external Kubernetes restart.Grafana links: