Question on per-file stats #5299

dentiny · 2025-10-05T06:15:12Z

dentiny
Oct 5, 2025

From https://github.com/delta-io/delta/blob/master/PROTOCOL.md#Per-file-Statistics, the per-file stats looks like

#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct Stats {
    /// Total number of physical records in the file.
    #[serde(rename = "numRecords")]
    pub num_records: i64,

    /// Whether per-column min/max bounds are tight.
    #[serde(rename = "tightBounds", skip_serializing_if = "Option::is_none")]
    pub tight_bounds: Option<bool>,

    /// Minimum values per column (may be nested).
    #[serde(rename = "minValues", skip_serializing_if = "Option::is_none")]
    pub min_values: Option<HashMap<String, serde_json::Value>>,

    /// Maximum values per column (may be nested).
    #[serde(rename = "maxValues", skip_serializing_if = "Option::is_none")]
    pub max_values: Option<HashMap<String, serde_json::Value>>,

    /// Null counts per column.
    #[serde(rename = "nullCount", skip_serializing_if = "Option::is_none")]
    pub null_count: Option<HashMap<String, i64>>,
}

For these maps, it's mapped from column name, which means when when we perform a schema evolution, like renaming a column, deleting and re-creating the same column, it's hard to tell.

I'm wondering why don't we use field-id as iceberg?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question on per-file stats #5299

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Question on per-file stats #5299

Uh oh!

dentiny Oct 5, 2025

Replies: 0 comments

dentiny
Oct 5, 2025