Python Polars 1.0.0-rc.2
Pre-release
Pre-release
💥 Breaking changes
- Make
hive_partitioningparameter default toNone, which is automatically enabled for single directory inputs, and disabled otherwise (#17106) - Split
replacefunctionality into two separate functions (#16921) - Default to writing binview data to IPC (#17084)
- Remove re-export of type aliases (#17032)
- Add
strictparameter toDataFrame/LazyFrame.dropand fix behavior to default to True (#17044) - Rename
ModuleUpgradeRequiredandPolarsPanicErrorerror, removeInvalidAsserterror (#17033) - Change data orientation inference logic for DataFrame construction and warn when row orientation is inferred (#16976)
- Properly apply
strictparameter in Series constructor (#16939) - Remove supertype definition of List and non-List types (#16918)
- Consistently convert to given time zone in Series constructor (#16828)
- Update
reshapeto return Array types instead of List types (#16825) - Default to raising on out-of-bounds indices in all
get/gatheroperations (#16841) - Native
selectorXOR set operation, guarantee consistent selector column-order (#16833) - Set
infer_schema_lengthas keyword-only argument instr.json_decode(#16835) - Update
set_sortedto only accept a single column (#16800) - Remove deprecated parameters in
Series.cut/qcutand update struct field names (#16741) - Expedited removal of certain deprecated functionality (#16754)
- Update some error types to more appropriate variants (#15030)
- Scheduled removal of deprecated functionality (#16715)
- Change default
offsetingroup_by_dynamicfrom 'negativeevery' to 'zero' (#16658) - Constrain access to globals from
DataFrame.sqlin favor of top-levelpl.sql(#16598) - Read 2D NumPy arrays as
Arraytype instead ofList(#16710) - Update
clipto no longer propagate nulls in the given bounds (#14413) - Change
str.to_datetimeto default to microsecond precision for format specifiers"%f"and"%.f"(#13597) - Update resulting column names in
pivotwhen pivoting by multiple values (#16439) - Preserve nulls in
ewm_mean,ewm_std, andewm_var(#15503) - Restrict casting for temporal data types (#14142)
- Support Decimal types by default when converting from Arrow (#15324)
- Remove serde functionality from
pl.read_jsonandDataFrame.write_json(#16550) - Update function signature of
nthto allow positional input of indices, removecolumnsparameter (#16510) - Rename struct fields of
rleoutput tolen/valueand update data type oflenfield (#15249) - Remove class variables from some DataTypes (#16524)
- Add
check_namesparameter toSeries.equalsand default toFalse(#16610)
⚠️ Deprecations
- Deprecate
sizeparameter in parametric testing strategies in favor ofmin_size/max_size(#17128) - Split
replacefunctionality into two separate functions (#16921) - Rename
DataFrame.melttounpivotand make parameters consistent withpivot(#17095) - Remove re-export of exceptions at top-level (#17059)
- Deprecate
dt.mean/dt.medianin favor ofmean/median(#16888) - Deprecate
LazyFrame.with_contextin favor of horizontal concatenation (#16860) - Rename parameter
descendingtoreverseintop_kmethods (#16817) - Rename
str.concattostr.joinand update default delimiter (#16790) - Deprecate
arctan2din favor ofarctan2(...).degrees()(#16786)
🚀 Performance improvements
- create UniqueKernel and improve bool implementation (#17160)
- parallel linearize in new streaming engine (#17050)
- Default to writing binview data to IPC (#17084)
- Parallelize arrow conversion if binview -> large_bin (#17083)
- GC buffers in if_then_else view kernel (#16993)
- Desugar
ANDfilter into multiple nodes (#16992) - Optimize generic argsort of row-encoding (#16894)
- Improve rle_id iteration perf and set sorted flags (#16893)
- Optimize string/binary sort (#16871)
- Use
split_atinsplit(#16865) - Use
split_atinstead of double slice in chunk splits. (#16856) - Don't rechunk in
align_if arrays are aligned (#16850) - Don't create small chunks in parallel collect. (#16845)
- Add dedicated no-null branch in
arg_sort(#16808) - Speed up
dt.offset_by2x for constant durations (#16728) - Toggle coalesce in
joinif non-coalesced key isn't projected (#16677) - Make
dt.truncate1.5x faster wheneveryis just a single duration (and not an expression) (#16666) - Always prune unused columns in semi/anti join (#16665)
✨ Enhancements
- Support reading byte stream split encoded floats and doubles in parquet (#17099)
- Add
float_scientificoption towrite_csv/sink_csv(#17111) - Support
Structfield selection in the SQL engine,RENAMEandREPLACEselect wildcard options (#17109) - Update
DataFrame.pivotto allowindex=Nonewhenvaluesis set (#17126) - Make
hive_partitioningparameter default toNone, which is automatically enabled for single directory inputs, and disabled otherwise (#17106) - Improve ipython autocomplete for LazyFrame and DataFrame (#17091)
- Split
replacefunctionality into two separate functions (#16921) - Improve schema inference for hive partitions (#17079)
- Rename
DataFrame.melttounpivotand make parameters consistent withpivot(#17095) - print row index in explain + dot (#17074)
- Support top-level
pl.colautocompletion for iPython (#17080) - Remove re-export of exceptions at top-level (#17059)
- predicate + projection pushdown in NDJson (#17068)
- Allow (non-)coalescing in join_asof (#17066)
- Turn of coalescing and fix mutation of join on expressions (#17061)
- Expand NDJson glob into one SCAN (#17063)
- Do not parse hive partitions from user provided base directory path (#17055)
- Support directory paths in scans for Parquet, IPC and CSV (#17017)
- Implement general array equality checks (#17043)
- Add
strictparameter toDataFrame/LazyFrame.dropand fix behavior to default to True (#17044) - Rename
ModuleUpgradeRequiredandPolarsPanicErrorerror, removeInvalidAsserterror (#17033) - Add
rechunkparameter toread_delta(#16991) - allow experimental metadata use on release (#17005)
- first working prototype of new streaming engine (#16970)
- Add simple version of
json_normalize(#17015) - Change data orientation inference logic for DataFrame construction and warn when row orientation is inferred (#16976)
- Desugar
ANDfilter into multiple nodes (#16992) - Handle textio even if not correct (#16971)
- Properly apply
strictparameter in Series constructor (#16939) - Add SQL support for
INTERSECTandEXCEPTops (#16960) - Add
PerformanceWarningto LazyFrame properties (#16964) - Add
collect_schemamethod toLazyFrameandDataFrame(#16929) - Allow setting file cache TTL on a per-file basis (#16891)
- Support Decimal inputs for
lit(#16950) - Implement multiply and division for lhs duration (#16948)
- Raise on invalid temporal arithmetic (#16934)
- Always end with a in-memory sink on collect (#16928)
- add style namespace (which defers to Great Tables) (#16809)
- Add
Schemaclass (#16873) - Normalize
value_counts(#16917) - add
eq/nefor moreFixedSizeLists (#16902) - setup skeleton (#16900)
- add fundamentals for new async-based streaming execution engine (#16884)
- Cache downloaded cloud IPC files (#16892)
- Consistently convert to given time zone in Series constructor (#16828)
- Improve
read_csvSQL table reading function defaults (better handle dates) (#16866) - Support SQL
VALUESclause and inline renaming of columns in CTE & derived table definitions (#16851) - Support Python
Enumvalues inlit(#16858) - convert to give time zone in
.str.to_datetimewhen values are offset-aware (#16742) - Update
reshapeto return Array types instead of List types (#16825) - Default to raising on out-of-bounds indices in all
get/gatheroperations (#16841) - Support
SQL"SELECT" with no tables, optimise registration of globals (#16836) - Native
selectorXOR set operation, guarantee consistent selector column-order (#16833) - Extend recognised
EXTRACTandDATE_PARTSQL part abbreviations (#16767) - Improve error message when raising integers to negative integers, improve docs (#16827)
- Return datetime for mean/median of Date colum (#16795)
- Update
set_sortedto only accept a single column (#16800) - Expose overflowing cast (#16805)
- Update
group_byiteration andpartition_byto always return tuple keys (#16793) - Support array arithmetic for equally sized shapes (#16791)
- Expedited removal of certain deprecated functionality (2) (#16779)
- Removal of
read_database_uripassthrough fromread_database(#16783) - Remove
pyxlsbengine fromread_database(#16784) - Add
check_orderparameter toassert_series_equal(#16778) - Enforce deprecation of keyword arguments as positional (#16755)
- Support cloud storage in
scan_csv(#16674) - Streamline SQL
INTERVALhandling and improve related error messages, updatesqlparser-rslib (#16744) - Support use of ordinal values in SQL
ORDER BYclause (#16745) - Support executing polars SQL against
pandasandpyarrowobjects (#16746) - Remove deprecated parameters in
Series.cut/qcutand update struct field names (#16741) - Expedited removal of certain deprecated functionality (#16754)
- Remove deprecated functionality from rolling methods (#16750)
- Update
date_rangeto no longer produce datetime ranges (#16734) - Mark
min_periodsas keyword-only forrollingmethods (#16738) - Remove deprecated
top_kparametersnulls_last,maintain_order, andmultithreaded(#16599) - Support order-by in window functions (#16743)
- Add SQL support for
NULLS FIRST/LASTordering (#16711) - Update some error types to more appropriate variants (#15030)
- Initial SQL support for
INTERVALstrings (#16732) - Scheduled removal of deprecated functionality (2) (#16724)
- Scheduled removal of deprecated functionality (#16715)
- Enforce deprecation of
offsetarg intruncateandround(#16655) - Change default
offsetingroup_by_dynamicfrom 'negativeevery' to 'zero' (#16658) - Constrain access to globals from
DataFrame.sqlin favor of top-levelpl.sql(#16598) - Read 2D NumPy arrays as
Arraytype instead ofList(#16710) - Update
clipto no longer propagate nulls in the given bounds (#14413) - Change
str.to_datetimeto default to microsecond precision for format specifiers"%f"and"%.f"(#13597) - Update resulting column names in
pivotwhen pivoting by multiple values (#16439) - Preserve nulls in
ewm_mean,ewm_std, andewm_var(#15503) - Restrict casting for temporal data types (#14142)
- Add many more auto-inferable datetime formats for
str.to_datetime(#16634) - Support Decimal types by default when converting from Arrow (#15324)
- Remove serde functionality from
pl.read_jsonandDataFrame.write_json(#16550) - Update function signature of
nthto allow positional input of indices, removecolumnsparameter (#16510) - Rename struct fields of
rleoutput tolen/valueand update data type oflenfield (#15249) - Remove class variables from some DataTypes (#16524)
- Add
check_namesparameter toSeries.equalsand default toFalse(#16610) - Dedicated
SQLInterfaceandSQLSyntaxerrors (#16635) - Add
DIVfunction support to the SQL interface (#16678) - Support non-coalescing streaming left join (#16672)
- Allow wildcard and exclude before struct expansions (#16671)
🐞 Bug fixes
- Use explicit turbofish to help rustc (#17159)
- Raise on invalid set dtypes (#17157)
- Fix corrupted reads for hive parts from cloud and projection pushdown failure on hive parts (#17152)
- Set intersection supertype (#17154)
ChainedWhenshould not inheritExpr(#17142)- Fix decompress_impl for csv with n_rows set (#17118)
- adds "polars-ops/timezones" dependency for "timezones" feature (#17115)
- Fix incorrect window std for chunked series (#17110)
- make
GetOutput::get_fieldfallible (#17114) - Fix melt panic (#17088)
- Fix expression autocomplete in ipython (#17072)
- Exclude index from expansion in rolling/group_by_dynamic (#17086)
- Update some
Seriesdunder method type signatures (#17053) - Fix oob of join with literals and empty table (#17047)
- Don't silently accept multi-table FROM clauses (implicit JOIN syntax) (#17028)
- Don't split up ANDed filters that are group-aware (#17031)
- Harden "async" check for users with out-of-date
sqlalchemylibraries (#17029) - error when sort_by of unequal length (#17026)
- properly catch not found explode cols (#17020)
- Correctly convert data frames to NumPy for C index order (#17000)
- Raise on invalid arithmetic shapes (#16986)
- Don't pushdown predicates in cross join if the refer to both tables (#16983)
- Fix projection pushdown with literal joins (#16981)
- Fix edge case in DataFrame constructor data orientation inference (#16975)
- Raise on list of objects (#16959)
- Handle strictness for Decimal Series construction (#15309)
- Don't panic in object to anyvalue (#16957)
- properly set
FAST_EXPLODE_LISTmetadata (#16951) - Raise informative error when writing object to file (#16954)
- Remove supertype definition of List and non-List types (#16918)
- Remove unwrap in
extend()(#16890) - Fix
should_rechunkcheck (#16852) - Ensure
read_excelandread_odsreturn identical frames across all engines when given empty spreadsheet tables (#16802) - Consistent behaviour when "infer_schema_length=0" for
read_excel(#16840) - Standardised additional SQL interface errors (#16829)
- Ensure that splitted ChunkedArray also flattens chunks (#16837)
- Reduce needless panics in comparisons (#16831)
- Reset if next caller clones inner series (#16812)
- Raise on non-positive json schema inference (#16770)
- Rewrite implementation of
top_k/bottom_kand fix a variety of bugs (#16804) - Fix comparison of UInt64 with zero (#16799)
- Fix incorrect parquet statistics written for UInt64 values > Int64::MAX (#16766)
- Fix boolean distinct (#16765)
DATE_PARTSQL syntax/parsing, improve some error messages (#16761)- Include
pl.qualifier for inner dtypes into_init_repr(#16235) - Column selection wasn't applied when reading CSV with no rows (#16739)
- Panic on empty df / null List(Categorical) (#16730)
- Only flush if operator can flush in streaming outer join (#16723)
- Raise unsupported cat array (#16717)
- Assert SQLInterfaceError is raised (#16713)
- Restrict casting for temporal data types (#14142)
- Handle nested categoricals in
assert_series_equalwhencategorical_as_str=True(#16700) - Improve
read_databasecheck for SQLAlchemy async Session objects (#16680) - Reduce scope of multi-threaded numpy conversion (#16686)
- Full null on dyn int (#16679)
- Fix filter shape on empty null (#16670)
📖 Documentation
- Add doc examples to
concat_list(#17127) - Add "coming from pandas" note to
DataFrame.uniquedocstring (#17119) - Fix some warnings during doc build (#17077)
- Properly expose
InProcessQueryin docs, mark as unstable (#17097) - Add upgrade guide for Python Polars 1.0.0 (#16914)
- Lots of additions to the SQL reference docs (#16990)
- Minor doctest fixes (#17002)
- Include a doc entry for every exception type (#17001)
- fixup bullet points in write_parquet (#16909)
- Update version switcher for 1.0.0 prereleases (#16847)
- Update link from Python API reference to user guide (#16849)
- Update docstring/test/etc usage of
selectandwith_columnsto idiomatic form (#16801) - Update versioning docs for 1.0.0 (#16757)
- Add docstring example for
DataFrame.limit(#16753) - Fix incorrect stated value of
include_nullsinDataFrame.updatedocstring (#16701) - Update deprecation docs in the user guide (#14315)
- Add example for index count in
DataFrame.rolling(#16600) - Improve docstring of
Expr/Series.map_elements(#16079) - Add missing
polars.sqldocs entry and small docstring update (#16656)
📦 Build system
- Do not change environment on import (#17101)
- Fix config flag for Tracemalloc (#17098)
- Pin optional NumPy dependency to
< 2.0.0for now (#17060)
🛠️ Other improvements
- Add missing spaces in
cargo.toml(#17145) - Update rustc 2024-06-23 (#17135)
- Minor test refactor for
concat_list(#17120) - Remove re-export of data type groups (#17073)
- Add pivot test #17081 (#17090)
- Minor cleanup to better define boundaries of public API (#17051)
- Support directory paths in scans for Parquet, IPC and CSV (#17017)
- Remove re-export of type aliases (#17032)
- Remove file cache test (#17038)
- Update exception imports in test suite (#17035)
- Point polars-stream to crates/ again (#17024)
- Fix failing file cache test in CI (#17014)
- Add some parametric tests for sort functionality (#17008)
- Pin NumPy to <2.0 for now (#16999)
- Use proper join type in test (#16994)
- Fix file cache verbose logging leakage during pytest (#16984)
- Skip another intermitently failing AWS test (#16980)
- Update test suite to explicitly use
orient="row"in DataFrame constructor when applicable (#16977) - Remove redundant projection attribute in IR::DataFrameScan (#16952)
- Factor out some apply calls in duration namespace (#16941)
- extend new streaming engine with some initial nodes (#16940)
- Skip intermittently failing AWS test (#16908)
- Refactor expression parsing utils (#16906)
- setup skeleton (#16900)
- Refactor parts of IR. (#16899)
- Move around some existing tests (#16877)
- Remove inner
ArcfromFileCacheEntry(#16870) - Do not update stable API reference on prerelease (#16846)
- Update links to API references (#16843)
- Prepare update of API reference URLs (#16816)
- Rename allow_overflow to wrap_numerical (#16807)
- Set
infer_schema_lengthas keyword-only argument instr.json_decode(#16835) - Don't enter streaming engine for groupby-> agg mean/median … (#16810)
- Improve safety of amortized_iter (#16820)
- Remove needless inner type clone (#16718)
- Fix incorrect debug assertion in
ChunkedArray::from_chunks_and_dtype(#16697) - Update version resolver for
1.0.0release (#16705) - Avoid AWS pinning to outdated crc32c version (#16681)
Thank you to all our contributors for making this release possible!
@JulianCologne, @KDruzhkin, @Kylea650, @MarcoGorelli, @Mottl, @Object905, @adamreeve, @alexander-beedie, @bertiewooster, @borchero, @c-peters, @coastalwhite, @datapythonista, @datenzauberai, @dependabot, @dependabot[bot], @eitsupi, @henryharbeck, @itamarst, @lukeshingles, @machow, @marenwestermann, @mcrumiller, @montanarograziano, @nameexhaustion, @orlp, @p3i0t, @ritchie46, @sherlockbeard, @stinodego, @tkellogg, @universalmind303 and @wence-