- Python 3.8+
- duckdb
- pandas
- pyarrow
Install dependencies with:
pip install -r requirements.txtproject_root/
├── benchmarks/ # Benchmark strategies
├── data/ # TPC-H tables in Parquet format
├── engine/
│ ├── duckdb_engine.py
│ └── custom_engine.py
├── results/
│ ├── benchmark/ # Custom engine query results (CSV)
│ └── target/ # DuckDB query results (CSV)
├── init.py # Data generation script
├── main.py # Main entry point
└── summary.csv
Run the init command to generate TPCH tables in Parquet format and initialize result directories:
python main.py initThe following files will be created for scale factors [0.5, 1, 2, 5]
data/sf{0.5, 1, 2, 5}/
├── customer.parquet
├── lineitem.parquet
├── nation.parquet
├── orders.parquet
├── part.parquet
├── partsupp.parquet
├── region.parquet
└── supplier.parquet
python main.py benchmark--out summary.csv: Output file for the benchmark results (default: summary.csv)--benchmark 5: Number of timed repetitions per scale factor after one warm-up run (default: 5)--strategy <strategy>: Benchmark execution strategy:interweave(default)duckdb_firstcustom_engine_first
--enable_profiling: Enable detailed profiling for the custom engine
This will:
- Clear previous results in
results/benchmarkandresults/target - Run both engines for scale factors 0.5, 1, 2, 5
- Run one warm-up iteration first, then average the next
--benchmarktimed runs - Save query results as CSVs in the results folders
- Write benchmark results to
summary.csv
python main.py checkThis will compare all matching CSV files in results/benchmark and results/target and print any mismatches.