Skip to content

Conversation

@jorgenherje
Copy link
Collaborator

@jorgenherje jorgenherje commented Dec 18, 2025

Remove usage of Pandas and replace usage with Polars

Note this PR does not replace for flow network access+service layer. It is handled in the following PR: #1395

Replace Pandas with Polars in:

  • Summary vector statistics calculation
  • Parameter utils
  • PVT converter

TODO:


Closes: #1401

- Summary vector statistics calc
- Parameter utils
- PVT converter
Copy link
Collaborator

@HansKallekleiv HansKallekleiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Approved, with a few comments after reading up on Polars.
Probably not important as everything is fairly fast, but might be worth thinking about at some point.
I think we should try to avoid things like loops and lamdas when working with Polars, as it forces serializing between rust and python.

@jorgenherje jorgenherje marked this pull request as draft December 19, 2025 08:14
@jorgenherje
Copy link
Collaborator Author

jorgenherje commented Dec 19, 2025

During testing of the statistics calculations I found deviation between results.

Have to verify statistics calculations/aggregation for Polars compared to old version with Pandas+numpy. Old verison uses numpy.nanpercentile, numpy.nanmean, etc for calculating statistics for the summary vectors. The new Polars version has pl.col().percentile(), pl.col().mean() etc. These has numerical difference, and in some cases the mean calc seem to vary a lot. Testing states that this is not due to downcast from f64 to f32, but points towards difference in algorithms for mean and percentile?

@sigurdp recalls that numpy calc was used rather than Pandas' own statistics due to how the algorithms worked. Perhaps this yields for Polars as well? If percentiles are calculated using the entire array or estimating parts of the data?

@jorgenherje
Copy link
Collaborator Author

Further testing shows that when using numpy the aggregation methods are performed with same input format as the data. I.e. if the input is float32, the mean and percentiles are found using float32. This provides numerical inaccuracy compared to polars which seems to cast the data to float64 internally, and cast back to same format as input.

Both polars and numpy aggregation is tested with input table with float32 and float64. During testing polars gives same numerical result for both float precisions , whereas numpy results differentiate when input format is float32 and float64. This is stated to be due to the fact that the actual mean calc is performed using float64 even if input is float32 when using Polars.

During testing, I got same results using Polars and Pandas+numpy if i casted the array to float64 before aggregating statistics for the Pandas-algorithm.

Conclusion:
Use Polars, and do not cast to float64, as the aggregation methods seems to handle it internally.

@jorgenherje jorgenherje marked this pull request as ready for review January 6, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025 EOY release enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove usage of Pandas in back-end

2 participants