PyDuck π¦ is a high-performance Python library for ML data preprocessing, built on DuckDB with a Pandas-like API. It enables fast, multi-threaded, out-of-core processing, handling large datasets efficiently. PyDuck accelerates ML workflows by optimizing queries while ensuring seamless integration with Pandas and other data tools. π
To install:
pip install pyduck
To clone the development version:
git clone https://github.com/your-username/pyduck.git
cd pyduck
pip install -e .
pip install -r requirements.txt
To run the testing suite (from the outer PyDuck directory):
python3 -m pytest testing/
To get started with basic operations:
from pyduck.quack import Quack
import duckdb
# Connect to DuckDB and load a table
q = Quack("customer", conn=duckdb.connect("tpch.duckdb"))
# Filter, group and aggregate
result = (
q.filter("c_acctbal > 1000")
[["c_mktsegment", "c_acctbal"]]
.groupby("c_mktsegment")
.agg({"c_acctbal": "mean"})
.to_df()
)
print(result)For a visual walkthrough of PyDuckβs system architecture and performance benchmarks, refer to the final presentation slides here: π https://docs.google.com/presentation/d/1SlYmPqAVnjJ9Cac_rlO5bipi_EX7rtRQAzGWK086Duo/edit?usp=sharing
All user operations begin with the Quack class in quack.py. A Quack is a dataframe-like object the is a chainable, immutable wrapper over a DuckDB table. A Quack can be considered a virtual table.
Each method (e.g., filter(), groupby(), agg()) appends a new operation to an internal list and returns a new Quack object.
Key methods in quack.py:
`
-
filter(condition)β Adds a WHERE clause -
assign(**kwargs)β Adds computed columns -
groupby(cols)+agg(dict)β Performs grouped aggregation -
fillna(...),dropna(...),isna(...)β Handles missing values -
sample(...)β Random sampling -
merge(...)β SQL joins between Quacks -
to_df()β Triggers SQL compilation and returns a Pandas DataFrame -
to_sql()β Generates SQL via SQLCompiler -
debug()β Prints the current operation chain
The SQLCompiler class translates the chain of Quack operations into a valid SQL query. It uses apply_operation(...) from the operations/ directory.
# compiler.py
from .operations import apply_operation
class SQLCompiler:
def compile(self):
for op, val in self.operations:
query = apply_operation(query, op, val, ...)
return query
Each operation is implemented in its own file inside the operations/ directory:
filter.py, dropna.py, fillna.py, groupby.py, etc.
Each contains an apply() function that defines how to transform the query string.
filter.py β applies a WHERE clause
groupby.py β injects GROUP BY SQL syntax
drop_duplicates.py β applies a ROW_NUMBER() trick
fillna.py β uses COALESCE() or CASE depending on value
Lazy execution: nothing runs until .to_df() or .execute() is called
Chainable & immutable: each operation returns a new Quack
SQL transparency: the final SQL is always inspectable via .to_sql()