-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42746][SQL][FOLLOWUP] Fixing potential flakiness in ListAgg golden files #50357
Conversation
@@ -1,13 +1,13 @@ | |||
-- Test cases with utf8_binary | |||
SELECT listagg(c1) WITHIN GROUP (ORDER BY c1 COLLATE utf8_binary) FROM (VALUES ('a'), ('A'), ('b'), ('B')) AS t(c1); | |||
SELECT listagg(DISTINCT c1 COLLATE utf8_binary) FROM (VALUES ('a'), ('A'), ('b'), ('B')) AS t(c1); | |||
WITH t(c1) AS (SELECT listagg(DISTINCT col1 COLLATE utf8_binary) FROM (VALUES ('a'), ('A'), ('b'), ('B'))) SELECT len(c1), regexp_count(c1, 'a'), regexp_count(c1, 'b'), regexp_count(c1, 'A'), regexp_count(c1, 'B') FROM t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why you introduce this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests in general, there is no guarantee that ordering of the column will be the same as the query insertion time, unless we include an order by clause. This is potentially making tests flaky, as golden file tests need to be regenerated, so this PR is only making sure we do not have to rerun CIs for tests potentially flaky tests. Also, it is better to prevent flakiness first, then to leave it for detection later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need this one if listagg
is deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark aggregate is not deterministic by nature, as the shuffle reader fetches shuffle blocks in a random order, and they can arrive in a random order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
thanks, merging to master/4.0! |
…lden files ### What changes were proposed in this pull request? Original PR (#50338) fixed some of the flakiness, but there were more tests that could potentially be flaky. This PR is fixing these issues. ### Why are the changes needed? We should not rely in golden files tests on buffer ordering. This could lead to flakiness in tests and we need to fix it, so that we do not waste resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test only change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50357 from mihailom-db/listaggfollowup2. Authored-by: Mihailo Milosevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e4b6e9a) Signed-off-by: Wenchen Fan <[email protected]>
…lden files ### What changes were proposed in this pull request? Original PR (apache#50338) fixed some of the flakiness, but there were more tests that could potentially be flaky. This PR is fixing these issues. ### Why are the changes needed? We should not rely in golden files tests on buffer ordering. This could lead to flakiness in tests and we need to fix it, so that we do not waste resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test only change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50357 from mihailom-db/listaggfollowup2. Authored-by: Mihailo Milosevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Original PR (#50338) fixed some of the flakiness, but there were more tests that could potentially be flaky. This PR is fixing these issues.
Why are the changes needed?
We should not rely in golden files tests on buffer ordering. This could lead to flakiness in tests and we need to fix it, so that we do not waste resources.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Test only change.
Was this patch authored or co-authored using generative AI tooling?
No.