Skip to content

fix: union all by name #15603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

chenkovsky
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

schema from inner physical plan is returned.

What changes are included in this PR?

update UnionExec and RecordBatchStreamAdapter to transform schema.

I found that logical plan's nullability for Projection is also not correct after optimization,
But this won't make the test fail. So I haven't included this part in this PR. Do we need to correct logical plan?

Are these changes tested?

UT

Are there any user-facing changes?

No

@chenkovsky chenkovsky marked this pull request as ready for review April 6, 2025 10:55
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 6, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @chenkovsky (and all the other PRs recently -- very much appreciated)

@@ -362,6 +362,8 @@ pin_project! {

#[pin]
stream: S,

transform_schema: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like this is fixing the symptom rather than the root cause

I think it would be better to have the correct schema reflected in the plan in the first place 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct nullability in schema is better. I tried to fix logical plan before.

But nullability in logical plan won't affect physical plan. it's ignored.

LogicalPlan::Projection(Projection { input, expr, .. }) => self

in physical plan, it will recompute nullaibility from bottom to top.

e.nullable(&input_schema)?,

but in this scenario, it seems that we need to pass nullability from top to bottom.

I need more suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to learn some experience from spark.

for logical plan, I haven't found any logic to handle this problem.

https://github.com/apache/spark/blob/75d80c7795ca71d24229010ab04ae740473126aa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L475

for physical plan, spark is much easier, its InternalRow is schemaless. so it will use the schema of physical plan by default. but recordbatch contains schema.

https://github.com/apache/spark/blob/75d80c7795ca71d24229010ab04ae740473126aa/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L688

I'm not 100% sure, I think current logical plan and physical plan schema is correct. the root cause is that recordbatch's schema doesn't match physical plan's. so adding an adapter is a proper way.

let ret = this.stream.poll_next(cx);
if transform_schema {
if let Poll::Ready(Some(Ok(batch))) = ret {
return Poll::Ready(Some(batch.with_schema(schema).map_err(|e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is one of the notorious problems when logical schema doesn't match the physical one on nullability/metadata. But this change might bring a performance impact, although the schema change is just reassigning the value but it also calls schema_contains which may be expensive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there's performance concern now. if this approach is feasible, I can try to optimize it, maybe use RecordBatch::try_new_with_options

@Omega359
Copy link
Contributor

Omega359 commented Apr 9, 2025

Thanks for looking into the nullable issue, it's been on my plate for a bit to look into some more. It's really the last blocker I know of for union by name to work correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

physical-expr: Nullability of Literal is not determined by surrounding context
4 participants