Skip to content

feat(variant): add Spark 4.0 VARIANT type and Parquet read support#355

Draft
oarap wants to merge 1 commit intobytedance:mainfrom
oarap:oarap_variant_spark
Draft

feat(variant): add Spark 4.0 VARIANT type and Parquet read support#355
oarap wants to merge 1 commit intobytedance:mainfrom
oarap:oarap_variant_spark

Conversation

@oarap
Copy link
Collaborator

@oarap oarap commented Mar 6, 2026

Introduce TypeKind::VARIANT as Spark-compatible STRUCT<value VARBINARY, metadata VARBINARY> and wire it through vectors, serde, and Spark SQL functions.

  • Add VARIANT type and VariantVector/VariantValue plumbing.
  • Implement Spark VARIANT dictionary parsing and decoding for both bit-coded and compact encodings (compact string length and unordered start-offset tables).
  • Support variant_get / parse_json over Spark VARIANT payloads.
  • Improve Parquet reader integration, including ScanSpec child ordering mismatch correction for (value, metadata).
  • Add Spark-generated VARIANT Parquet fixtures and Parquet reader/unit test coverage.

What problem does this PR solve?

Issue Number: close #xxx

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Describe your changes in detail.
For complex logic, explain the "Why" and "How".

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@oarap oarap force-pushed the oarap_variant_spark branch from a0c26eb to 84ece88 Compare March 18, 2026 16:35
Introduce TypeKind::VARIANT as Spark-compatible STRUCT<value VARBINARY, metadata VARBINARY> and wire it through vectors, serde, and Spark SQL functions.

- Add VARIANT type and VariantVector/VariantValue plumbing.
- Implement Spark VARIANT dictionary parsing and decoding for both bit-coded and compact encodings (compact string length and unordered start-offset tables).
- Support variant_get / parse_json over Spark VARIANT payloads.
- Improve Parquet reader integration, including ScanSpec child ordering mismatch correction for (value, metadata).
- Add Spark-generated VARIANT Parquet fixtures and Parquet reader/unit test coverage.
@oarap oarap force-pushed the oarap_variant_spark branch from 84ece88 to d90040e Compare March 18, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant