Skip to content

Conversation

@gferrate
Copy link
Contributor

@gferrate gferrate commented Feb 6, 2026

Adds a test reproducing the issue described in #187.

Since the test fails by design, it is marked as #[ignore].

The underlying issue is that metrics are attached to the last FlightData message of the last partition stream on the worker side (in do_get.rs). When a LIMIT is present, the client-side network boundary node (e.g. NetworkShuffleExec) drops the stream once it has enough rows, before the worker finishes all partitions. The final message carrying the metrics is never sent or never consumed, so metrics for that stage are lost.

The test runs a GROUP BY ... LIMIT 1 query against the flights_1m dataset (1M rows). The large dataset ensures the worker is still producing data when the LIMIT causes the client to drop the stream.

Note: This don't actually solve the issue, just showcases that it is really an issue!

@gferrate gferrate force-pushed the gferrate/investigate-187-limit-metrics branch from 9df7855 to aa94f31 Compare February 6, 2026 17:17
@gferrate gferrate changed the title Adds test for replicating test: reproduce metrics loss on early stream termination Feb 6, 2026
@gferrate gferrate marked this pull request as ready for review February 6, 2026 17:40
Comment on lines +395 to +399
panic!(
"Missing metrics for stage key {expected_stage_key:?}. \
The LIMIT caused the stream to be dropped before the worker \
sent the last FlightData message with metrics."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ This is the bug. Uncomment #[ignore] to reproduce.

Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 nice

/// client drops the stream early. Un-ignore it to reproduce the issue or to verify a fix.
#[tokio::test]
#[ignore]
async fn test_metrics_collection_with_limit_causing_early_stream_termination() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that another consequence of this is that calling rewrite_distributed_plan_with_metrics on the plan fails with an error that says something like "not enough metrics provided to rewrite task".

This is expected an another way of reproducing the issue. I wonder if rewrite_distributed_plan_with_metrics should be error proof instead?

cc @jayshrivastava

@gabotechs gabotechs merged commit 47e132e into datafusion-contrib:main Feb 9, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants