22 Jul 10:25

ranaroussi

1.0.1 Latest

Latest

What's Changed

Update numpy dependency to support numpy 3.0 by @ranaroussi in #80

Full Changelog: 1.0.0...1.0.1

Contributors

ranaroussi

Assets 2

20 Jul 23:34

ranaroussi

1.0.0

Change Log

1.0.0 (2025 Release)

Major Release with Performance Optimizations and Advanced Features

Modernization & Dependencies:

Updated to Python 3.8+ only (dropped Python 2.7/3.5/3.6/3.7 support)
Migrated from deprecated Fastparquet to PyArrow as the sole Parquet engine
Updated all dependencies to modern versions (Pandas 2.0+, PyArrow 10.0+, Dask 2023.1+)
Fixed all pandas deprecation warnings (replaced is_*_dtype with isinstance checks)
Removed legacy code and deprecated features

New Features - Data Types & Storage:

MultiIndex Support: Full support for storing and retrieving DataFrames with pandas MultiIndex
Complex Data Types: Support for Timedelta, Period, Interval, and Categorical dtypes
Nested Objects: Support for storing lists, dicts, and nested DataFrames as columns
Timezone-Aware Operations: Proper handling of timezone data with UTC storage
Schema Evolution: Flexible strategies for handling schema changes over time
Data Validation Framework: Extensible validation rules with built-in validators

New Features - API & Operations:

Async/Await Support: Non-blocking I/O operations via async_pystore
Transaction Support: Atomic operations with rollback capabilities
Context Managers: Transaction and batch operation context managers
Validation Hooks: Set validators at collection level with custom rules

Performance Optimizations:

Streaming Operations: Memory-efficient append for datasets larger than RAM (90% memory reduction)
Batch Operations: 5-10x faster parallel read/write for multiple items
Intelligent Partitioning: Automatic time-based and size-based partitioning
Memory Management: 70% memory reduction with automatic DataFrame optimization
Metadata Caching: 100x faster metadata access with TTL cache
Query Optimization: Column selection and predicate pushdown at storage level

Bug Fixes & Improvements:

Fixed append method to properly handle duplicates and schema evolution
Fixed MultiIndex dtype preservation during storage operations
Fixed timezone handling to ensure consistency across operations
Fixed Period dtype frequency conversion issues (ME -> M)
Fixed nested object serialization with proper null handling
Improved error messages and validation throughout
Added comprehensive test coverage for all new features

API Additions:

collection.append_stream() - Streaming append for large datasets
collection.write_batch() - Parallel write of multiple items
collection.read_batch() - Efficient read of multiple items
collection.set_validator() - Set data validation rules
collection.enable_schema_evolution() - Enable flexible schema handling
async_pystore.store() - Async store context manager
transaction() - Single transaction context manager
batch_transaction() - Batch operation context manager
Memory management utilities in pystore.memory module
Partitioning utilities in pystore.partition module

Breaking Changes:

Removed Python 2.7 and Python < 3.8 support
Removed Fastparquet support (PyArrow only)
Changed some internal APIs for better consistency

Full Changelog: 0.1.24...1.0.0

Assets 2

10 Jul 10:12

ranaroussi

0.1.24

Thanks to @gnzsnz !

deprecated Fastparquet
enable Pyarrow as default engine

NOTE: Existing stores using fastparquet will require a manual update on metadata.json.

Contributors

gnzsnz

Assets 2

11 Feb 11:36

ranaroussi

0.1.23

Fixed deprecate 'in' operator to be compatible with pandas 1.2.0 onwards (PR #58)
Add argument to append() to control duplicates (PR #57)

Assets 2