Skip to content

Add unified and configurable null handling#1101

Open
ryankert01 wants to merge 1 commit intoapache:mainfrom
ryankert01:null-handling
Open

Add unified and configurable null handling#1101
ryankert01 wants to merge 1 commit intoapache:mainfrom
ryankert01:null-handling

Conversation

@ryankert01
Copy link
Member

Related Issues

Closes #765

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Why

How

  • Unify null value handling between batch and streaming Parquet/Arrow readers by introducing a configurable NullHandling enum (FillZero | Reject)
  • Previously batch mode silently coerced nulls to 0.0 while streaming mode threw a runtime error — both paths now use the same handle_float64_nulls() helper
  • Defaults to FillZero for backward compatibility; Reject returns a clear error with guidance
  • Threads the policy through PipelineConfig, PyO3 bindings, and the Python QuantumDataLoader builder (.null_handling("fill_zero" | "reject"))

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes

Copy link
Contributor

@viiccwen viiccwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lg, but need to update docs, and left some comments.

//! }
//! ```

use arrow::array::{Array, Float64Array};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array is not used in this file.

/// Reads Float64 data from a Parquet file.
///
/// Expects a single Float64 column. For zero-copy access, use [`read_parquet_to_arrow`].
pub fn read_parquet<P: AsRef<Path>>(path: P) -> Result<Vec<f64>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we specify read_* ( read_parquet(...) ) to use NullHandling::FillZero, should we adopt same strategy like *Readers (ParquetReader)?

like this?

pub fn read_parquet<P: AsRef<Path>>(path: P) -> Result<Vec<f64>>;
// to
pub fn read_parquet<P: AsRef<Path>>(
    path: P,
    null_handling: Option<NullHandling>,
) -> Result<Vec<f64>>;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QDP] Unify Null Value Handling Strategy between Batch and Streaming Modes

2 participants