Skip to content

kahhong/cs465-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Specialised Query Executor for TPC-H Q18

Prerequisites

  • Python 3.8+
  • duckdb
  • pandas
  • pyarrow

Install dependencies with:

pip install -r requirements.txt

Directory Structure

project_root/
├── benchmarks/            # Benchmark strategies
├── data/                  # TPC-H tables in Parquet format
├── engine/
│   ├── duckdb_engine.py
│   └── custom_engine.py
├── results/
│   ├── benchmark/         # Custom engine query results (CSV)
│   └── target/            # DuckDB query results (CSV)
├── init.py                # Data generation script
├── main.py                # Main entry point
└── summary.csv

1. Initialize Project (Generate TPC-H Data)

Run the init command to generate TPCH tables in Parquet format and initialize result directories:

python main.py init

The following files will be created for scale factors [0.5, 1, 2, 5]

data/sf{0.5, 1, 2, 5}/
├── customer.parquet
├── lineitem.parquet
├── nation.parquet
├── orders.parquet
├── part.parquet
├── partsupp.parquet
├── region.parquet
└── supplier.parquet

2. Run the Benchmark

python main.py benchmark

Optional arguments:

  • --out summary.csv: Output file for the benchmark results (default: summary.csv)
  • --benchmark 5: Number of timed repetitions per scale factor after one warm-up run (default: 5)
  • --strategy <strategy>: Benchmark execution strategy:
    • interweave (default)
    • duckdb_first
    • custom_engine_first
  • --enable_profiling: Enable detailed profiling for the custom engine

This will:

  • Clear previous results in results/benchmark and results/target
  • Run both engines for scale factors 0.5, 1, 2, 5
  • Run one warm-up iteration first, then average the next --benchmark timed runs
  • Save query results as CSVs in the results folders
  • Write benchmark results to summary.csv

3. Data Correctness Check

python main.py check

This will compare all matching CSV files in results/benchmark and results/target and print any mismatches.

About

An CS465 - Advanced Database project comparing custom query engine performance against DuckDB across TPC-H-like datasets with benchmark scripts, engines, and results.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages