Skip to content

[Data] Add UnresolvedExpr placeholder for future schema resolution#60794

Open
slfan1989 wants to merge 3 commits intoray-project:masterfrom
slfan1989:feature/add-unresolved-expr-placeholder
Open

[Data] Add UnresolvedExpr placeholder for future schema resolution#60794
slfan1989 wants to merge 3 commits intoray-project:masterfrom
slfan1989:feature/add-unresolved-expr-placeholder

Conversation

@slfan1989
Copy link
Contributor

Description

This PR introduces UnresolvedExpr as a new expression type to represent column references that have not yet been resolved against a concrete schema. This is a foundational change that prepares the expression system for future schema-aware resolution workflows.

Key changes:

  • Added UnresolvedExpr class with name property and data_type: None
  • Updated all expression visitors to handle UnresolvedExpr (PyArrow, Iceberg, evaluator, collectors, repr visitors)
  • Changed StarExpr.data_type and Expr.data_type to DataType | None for semantic consistency
  • Added comprehensive unit tests and API documentation

Why this is needed:
Currently, the expression system cannot represent column references that haven't been resolved against a schema. This is needed for string-based expression parsing, lazy schema evaluation, and deferred type checking scenarios. UnresolvedExpr serves as an explicit placeholder that fails fast when accidentally evaluated or converted, preventing subtle bugs.

Note: This PR does not implement resolution logic. UnresolvedExpr currently cannot be evaluated or converted—it only serves as a placeholder. The resolver will be added in a follow-up PR.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Implementation Details

New Expression Type:

  • UnresolvedExpr(name: str) - immutable dataclass with no defined data_type
  • Structural equality based on name comparison
  • Repr: UNRESOLVED('name') (tree) / unresolved('name') (inline)

Visitor Updates:
All expression visitors now handle UnresolvedExpr:

  • Conversion visitors (_PyArrowExpressionVisitor, _IcebergExpressionVisitor): Raise TypeError with clear error messages
  • Evaluator (NativeExpressionEvaluator): Rejects evaluation with helpful error
  • Column collectors (_ColumnReferenceCollector): Collects unresolved names like regular column references
  • Substitution visitor (_ColumnSubstitutionVisitor): Supports substituting unresolved expressions
  • Repr visitors: Displays unresolved state clearly in debug output

Type System Alignment:
Both StarExpr and UnresolvedExpr now use data_type: None instead of DataType(object), making it explicit that these are placeholders without concrete types.

Follow-up Work Plan

The resolution mechanism will be implemented in subsequent PRs:

  • Define creation entry points - Add string-to-expression parsing that produces UnresolvedExpr
  • Implement resolution/binding phase - Design resolver pass that converts UnresolvedExpr → ColumnExpr with schema validation
  • Add resolution strategy - Define resolution rules and error handling for missing/ambiguous columns
  • Comprehensive resolution tests - Cover success cases, schema mismatches, and edge cases

This commit introduces UnresolvedExpr as a first-class expression type
to represent column references that have not yet been resolved against
a concrete schema.

Changes:
- Add UnresolvedExpr class with name property and no defined data_type
- Update all expression visitors to handle UnresolvedExpr:
  * PyArrow/Iceberg visitors: fail with clear error messages
  * NativeExpressionEvaluator: reject evaluation with TypeError
  * ColumnReferenceCollector: collect unresolved names
  * ColumnSubstitutionVisitor: support substitution
  * Tree/inline repr visitors: display as UNRESOLVED(name)
- Change StarExpr.data_type from DataType(object) to None for consistency
- Change Expr.data_type type annotation to DataType | None
- Add comprehensive unit tests for UnresolvedExpr
- Update API documentation to include UnresolvedExpr

Rationale:
This prepares the expression system for schema-aware resolution while
ensuring unresolved expressions cannot be accidentally evaluated or
converted. Both StarExpr and UnresolvedExpr now share the same semantic:
placeholders without concrete types.

Note: This PR does not implement resolution logic. The resolver will be
added in a follow-up PR.

Signed-off-by: slfan1989 <slfan1989@apache.org>
@slfan1989 slfan1989 requested a review from a team as a code owner February 6, 2026 01:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces UnresolvedExpr as a placeholder for column references that have not yet been resolved. This is a solid foundational change that will enable future schema-aware resolution workflows. The changes are consistently applied across the expression system, including all relevant visitors and the base Expr class. The modification of Expr.data_type to DataType | None is a good move for semantic consistency. The accompanying unit tests are thorough and cover the new functionality well. I have one minor suggestion to improve a docstring for clarity. Overall, this is an excellent contribution.

@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 6, 2026
slfan1989 and others added 2 commits February 6, 2026 22:00
…valuator.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: slfan1989 <55643692+slfan1989@users.noreply.github.com>
This commit introduces UnresolvedExpr as a first-class expression type
to represent column references that have not yet been resolved against
a concrete schema.

Changes:
- Add UnresolvedExpr class with name property and no defined data_type
- Update all expression visitors to handle UnresolvedExpr:
  * PyArrow/Iceberg visitors: fail with clear error messages
  * NativeExpressionEvaluator: reject evaluation with TypeError
  * ColumnReferenceCollector: collect unresolved names
  * ColumnSubstitutionVisitor: support substitution
  * Tree/inline repr visitors: display as UNRESOLVED(name)
- Change StarExpr.data_type from DataType(object) to None for consistency
- Change Expr.data_type type annotation to DataType | None
- Add comprehensive unit tests for UnresolvedExpr
- Update API documentation to include UnresolvedExpr

Rationale:
This prepares the expression system for schema-aware resolution while
ensuring unresolved expressions cannot be accidentally evaluated or
converted. Both StarExpr and UnresolvedExpr now share the same semantic:
placeholders without concrete types.

Note: This PR does not implement resolution logic. The resolver will be
added in a follow-up PR.

Signed-off-by: slfan1989 <slfan1989@apache.org>
@iamjustinhsu
Copy link
Contributor

Hi @slfan1989!, thanks for the contribution. Right now, we are currently revamping the Expressions to support Unresolved vs Resolved Expressions. The team right now is currently aligning on the exact design, and in case ur curious, here is the prototype #59117. Since this change and subsequent changes need to be very foundational and extendable, i'm going to hold off reviewing this PR until we can be sure this is the right approach. You're welcome to be a part of that discussion too through our public slack channel

@iamjustinhsu iamjustinhsu self-assigned this Feb 6, 2026
@slfan1989
Copy link
Contributor Author

slfan1989 commented Feb 6, 2026

Hi @slfan1989!, thanks for the contribution. Right now, we are currently revamping the Expressions to support Unresolved vs Resolved Expressions. The team right now is currently aligning on the exact design, and in case ur curious, here is the prototype #59117. Since this change and subsequent changes need to be very foundational and extendable, i'm going to hold off reviewing this PR until we can be sure this is the right approach. You're welcome to be a part of that discussion too through our public slack channel

@iamjustinhsu Thank you for the thoughtful response and for the update on the ongoing revamp!

I’m really excited about the direction of supporting Unresolved vs. Resolved Expressions — it sounds like a clean and useful foundation for future extensions.

I’ll check out prototype #59117 to better understand the planned changes. In the meantime, if there are any parts of my current implementation that align with (or conflict with) the new design, I’d really appreciate any early feedback so I can refactor accordingly.

Also, I’d be happy to join the discussion in the public Slack channel to learn more and see if there’s any way I can help contribute.

Thanks again for your time and for keeping the project moving forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants