test: reproduce metrics loss on early stream termination #341

gferrate · 2026-02-06T16:28:42Z

Adds a test reproducing the issue described in #187.

Since the test fails by design, it is marked as #[ignore].

The underlying issue is that metrics are attached to the last FlightData message of the last partition stream on the worker side (in do_get.rs). When a LIMIT is present, the client-side network boundary node (e.g. NetworkShuffleExec) drops the stream once it has enough rows, before the worker finishes all partitions. The final message carrying the metrics is never sent or never consumed, so metrics for that stage are lost.

The test runs a GROUP BY ... LIMIT 1 query against the flights_1m dataset (1M rows). The large dataset ensures the worker is still producing data when the LIMIT causes the client to drop the stream.

Note: This don't actually solve the issue, just showcases that it is really an issue!

gferrate · 2026-02-09T10:44:12Z

src/metrics/task_metrics_collector.rs

+                    panic!(
+                        "Missing metrics for stage key {expected_stage_key:?}. \
+                         The LIMIT caused the stream to be dropped before the worker \
+                         sent the last FlightData message with metrics."
+                    )


ℹ️ This is the bug. Uncomment #[ignore] to reproduce.

gabotechs

💯 nice

gabotechs · 2026-02-09T10:44:19Z

src/metrics/task_metrics_collector.rs

+    /// client drops the stream early. Un-ignore it to reproduce the issue or to verify a fix.
+    #[tokio::test]
+    #[ignore]
+    async fn test_metrics_collection_with_limit_causing_early_stream_termination() {


I've seen that another consequence of this is that calling rewrite_distributed_plan_with_metrics on the plan fails with an error that says something like "not enough metrics provided to rewrite task".

This is expected an another way of reproducing the issue. I wonder if rewrite_distributed_plan_with_metrics should be error proof instead?

cc @jayshrivastava

test: reproduce metrics loss on early stream termination

aa94f31

gferrate force-pushed the gferrate/investigate-187-limit-metrics branch from 9df7855 to aa94f31 Compare February 6, 2026 17:17

gferrate changed the title ~~Adds test for replicating~~ test: reproduce metrics loss on early stream termination Feb 6, 2026

gferrate marked this pull request as ready for review February 6, 2026 17:40

gferrate commented Feb 9, 2026

View reviewed changes

gabotechs approved these changes Feb 9, 2026

View reviewed changes

gabotechs merged commit 47e132e into datafusion-contrib:main Feb 9, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: reproduce metrics loss on early stream termination #341

test: reproduce metrics loss on early stream termination #341

gferrate commented Feb 6, 2026 •

edited

Loading

Uh oh!

gferrate Feb 9, 2026

Uh oh!

gabotechs left a comment

Uh oh!

gabotechs Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

test: reproduce metrics loss on early stream termination #341

test: reproduce metrics loss on early stream termination #341

Conversation

gferrate commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gferrate Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

gabotechs Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gferrate commented Feb 6, 2026 •

edited

Loading