Skip to content

bug: AVG OVER WINDOW returns incorrect values #26065

@pomo-mondreganto

Description

@pomo-mondreganto

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

#[test]
fn iss_avg_over_order_by_cumulative() {
    // Test that AVG() OVER (ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    // produces a running/cumulative average.
    //
    // Currently broken: AVG uses visit_unary(Expr::mean) which doesn't have a cumulative
    // equivalent like SUM/MIN/MAX do (cum_sum, cum_min, cum_max).
    //
    // The manual workaround using SUM()/ROW_NUMBER() works correctly.

    let df = df! {
        "foo" => [1, 2, 3, 4, 5],
        "bar" => [10.0, 20.0, 30.0, 40.0, 50.0]
    }
    .unwrap()
    .lazy();

    let mut ctx = SQLContext::new();
    ctx.register("df", df);

    // This should compute cumulative average: [10, 15, 20, 25, 30]
    let query_avg = r#"
        SELECT
            foo,
            bar,
            AVG(bar) OVER (
                ORDER BY foo
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS mean_avg
        FROM df
        ORDER BY foo
    "#;

    // This workaround computes cumulative average correctly: [10, 15, 20, 25, 30]
    let query_manual = r#"
        SELECT
            foo,
            bar,
            SUM(bar) OVER (
                ORDER BY foo
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) / ROW_NUMBER() OVER (
                ORDER BY foo
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS mean_manual
        FROM df
        ORDER BY foo
    "#;

    let df_avg = ctx.execute(query_avg).unwrap().collect().unwrap();
    let df_manual = ctx.execute(query_manual).unwrap().collect().unwrap();

    // Expected cumulative averages: 10/1=10, 30/2=15, 60/3=20, 100/4=25, 150/5=30
    let expected_means = vec![10.0, 15.0, 20.0, 25.0, 30.0];

    // Check that manual calculation produces correct results
    let manual_means: Vec<f64> = df_manual
        .column("mean_manual")
        .unwrap()
        .f64()
        .unwrap()
        .into_no_null_iter()
        .collect();
    assert_eq!(
        manual_means, expected_means,
        "Manual SUM/ROW_NUMBER calculation should produce cumulative average"
    );

    // Check that AVG OVER produces the same results (currently fails)
    let avg_means: Vec<f64> = df_avg
        .column("mean_avg")
        .unwrap()
        .f64()
        .unwrap()
        .into_no_null_iter()
        .collect();
    assert_eq!(
        avg_means, expected_means,
        "AVG() OVER (ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) should produce cumulative average"
    );
}

Log output

Issue description

AVG function in SQL engine does not have a cumulative option and therefore does not work with windows (returning the average of all rows instead). There's a workaround with sum and row_number over the same window which works, but I believe fixing avg is worth too.

Expected behavior

Queries in the example return the same data.

Installed versions

Details working with a master version of polars directly in polars-sql crate

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-sqlArea: Polars SQL functionalitybugSomething isn't workingneeds triageAwaiting prioritization by a maintainerrustRelated to Rust Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions