Skip to content

Release 1.0.1

Latest

Choose a tag to compare

@github-actions github-actions released this 16 May 13:09
ce519eb

ParquetDB 1.0.1 Release Notes

We are thrilled to announce the 1.0.1 release of ParquetDB!

ParquetDB is a Python library designed to be a "middleware" solution, effectively bridging the gap between simple file-based storage (like CSV, JSON) and more complex, full-fledged database systems. It leverages the power and efficiency of Apache Parquet files while providing a user-friendly, database-like interface for managing and querying your data.

This release marks a significant milestone in providing a robust and streamlined solution for researchers and developers working with evolving, complex, and nested datasets, particularly in environments where traditional databases are overkill or impractical, such as HPC clusters with limited connectivity.

Why ParquetDB?

In many research and development workflows, data storage needs fall into a challenging middle ground:

  • Traditional file formats (CSV, JSON) are simple but inefficient for numerical data, lack querying capabilities, and struggle with schema evolution and complex data types.
  • Binary formats like HDF5 are more efficient for numerical data but act more like structured file containers, lacking rich querying APIs and easy management of data relationships.
  • Full database systems (SQL or NoSQL) offer robust features but can be overly complex to set up and manage, introduce rigidity in schema management (SQL), or present consistency challenges (some NoSQL). They often require server configurations, making them less suitable for lightweight experimentation or "classically serverless" deployments.
  • Directly using libraries like PyArrow with Parquet files provides efficiency but requires significant boilerplate for database-like operations (CRUD), schema consistency, and handling complex Python objects.

ParquetDB was born out of the need to address these limitations, specifically for iterative research workflows requiring:

  • Schema Evolvability: Seamlessly adapt your data schema over time without upfront rigidity.
  • Complex Nested Data Structures: Natively handle and manage intricate, evolving nested data.
  • Table and Field-Level Metadata: Easily manage metadata associated with your datasets.
  • "Classically Serverless" Operation: Ideal for environments like HPC clusters with no reliance on network-connected database servers.
  • Performance: Efficient data storage, retrieval, and querying.

Key Features in 1.0.0:

  • Simple, Database-like Interface: Intuitive methods for create, read, update, and delete operations.
  • Leverages Apache Parquet: Benefits from columnar storage for efficient compression and read performance.
  • Minimal Overhead: Achieves competitive read/write speeds without the complexity of traditional database setup.
  • Handles Complex Data Types: Natively supports nested structures, arrays, and even Python objects (via pickling).
  • Schema Evolution: Add new fields and update schemas without hassle.
  • Efficient Querying: Utilizes predicate pushdown for optimized data retrieval.
  • Normalization: Tools to balance data distribution across files for consistent performance.
  • Batching: Efficiently process large datasets.
  • Pandas DataFrame Integration: Easily add data from and read data into Pandas DataFrames.

Performance Highlights:

Our benchmarks demonstrate that ParquetDB offers competitive performance:

  • Write Performance: Competitive creation times, performing well against SQLite as dataset sizes increase.
  • Read Performance: While initial reads on very small datasets might be comparable, ParquetDB significantly outperforms competitors like SQLite and MongoDB on larger datasets for bulk read operations, showcasing the efficiency of the underlying columnar Parquet format.
  • Query Performance: Effectively uses predicate pushdown with Parquet's field-level statistics for efficient filtering, excelling when querying or returning substantial portions of wide datasets.

(For detailed benchmarks and comparisons, please refer to our documentation and forthcoming paper: Lang, ParquetDB: A Lightweight Python Library for Serverless Management of Complex, Evolving Datasets Using Apache Parquet, 2025).

When to Choose ParquetDB:

ParquetDB shines when you're dealing with:

  • Complex and deeply nested data.
  • Schemas that are expected to evolve over time.
  • The need for a serverless solution that manages collections of Parquet files as a coherent, updatable dataset.
  • Scenarios where full database systems are too heavy, but basic file I/O is insufficient.

If your data is simple, flat, and has a stable schema, tools like DuckDB or direct Parquet file management with Pandas/PyArrow might be sufficient. However, ParquetDB offers a streamlined approach for more intricate data management challenges.