This directory contains the unified evaluation suite for DAComp-DE (Data Engineering) tasks. It is designed to automatically score DuckDB databases produced by candidate data pipelines. The primary entry point is evaluate.py, with core logic implemented in utils.py.
evaluate.py: The command-line entry point supporting both single-task and batch evaluation.utils.py: Handles threshold validation, core consistency checks, and layered/mixed scoring logic.evaluation_config_compare.yaml: The evaluation configuration file defining database filenames, layer hierarchies (e.g.,staging,intermediate,marts), table definitions, and scoring weights for each example.gold/: The directory for Ground Truth (Gold) data. It stores the standard DuckDB database files organized by example ID (e.g.,gold/dacomp-de-impl-001/lever.duckdb).README.md: This documentation.
We recommend using Python 3.9+. Install the required dependencies using the following command:
pip install duckdb pandas numpy pyyamlThe evaluation script expects the Prediction Directory (pred_dir) to follow the structure below:
<pred_dir>/
dacomp-de-impl-001/
run.py
lever.duckdb # Generated by run.py
config/...
sql/...
dacomp-de-impl-002/
run.py
pendo.duckdb
...
Key Requirements:
- Each example must have its own subdirectory named after its
example_id(e.g.,dacomp-de-impl-001). - Each subdirectory must contain an executable
run.pyscript responsible for building the corresponding DuckDB database file (e.g.,lever.duckdb,pendo.duckdb). - The expected database filename is defined in
evaluation_config_compare.yamlunderexamples.<example_id>.database_file.
Evaluation Process:
- Execution: The script runs
pred_dir/<example_id>/run.pyto generate or update the DuckDB database. (Use--force-rebuildto delete existing databases and enforce a fresh run). - Comparison: The script compares the generated database at
pred_dir/<example_id>/<database_file>against the ground truth atgold/<config_id>/<database_file>.
Note: The
config_idused in the Gold directory is derived from the configuration file. While usually identical to theexample_idin the prediction directory, they are associated via prefix matching.
The gold_dir should be structured as follows:
<gold_dir>/
dacomp-de-impl-001/
lever.duckdb
dacomp-de-impl-002/
pendo.duckdb
...
- Subdirectory names usually correspond to the example IDs in the configuration file.
- Each subdirectory contains the ground truth DuckDB database file, as specified by the
database_filefield in the configuration.
cfs(Default): Evaluates the entire pipeline by layer and table on the generated DuckDB database. This checks the overall effectiveness of the pipeline.cs: Performs a step-by-step correctness check by running your SQL within the Gold environment table by table. This is useful for debugging specific table errors.- Warning: This mode requires significant memory. We recommend allocating at least 16GB RAM.
All commands should be executed from the dacomp-de/evaluation_suite directory:
cd dacomp-de/evaluation_suiteTo evaluate a specific example:
python evaluate.py single \
--pred_dir /path/to/pred_dir \
--example_id dacomp-de-impl-001 \
--gold_dir ./gold \
--config evaluation_config_compare.yaml \
--mode cfs \
--force-rebuild \
--output result_impl_001.jsonParameters:
--pred_dir: The root directory containing prediction subdirectories.--gold_dir: The root directory for ground truth data (default:./gold).--config: The evaluation configuration file (default:evaluation_config_compare.yaml).--example_id: The specific example directory name withinpred_dirto evaluate.--output: (Optional) File path to save the full JSON result. If omitted, key results are printed to stdout.--force-rebuild: Forces the deletion of existing DuckDB files and re-runsrun.py.--mode:cfs(default) orcs.
Output:
- Terminal: Displays
Final Score: xx.xxandEvaluation Level(e.g.,core_perfect_match,partial_evaluation). - JSON File (if
--outputis set): Contains detailed metrics includingthreshold_evaluation(schema checks),core_accuracy_evaluation,partial_evaluation(layer-wise scores), and the final summary.
To evaluate multiple examples within a pred_dir and generate a summary report.
If --examples is not provided, the script scans pred_dir for all subdirectories matching the configuration IDs:
python evaluate.py batch \
--pred_dir /path/to/pred_dir \
--gold_dir ./gold \
--config evaluation_config_compare.yaml \
--mode cfs \
--force-rebuild \
--output_dir ./resultsResults Structure:
The script generates a directory structure in output_dir based on parsed metadata:
./results/<model>/<task>/<param_tag>/<mode>/
summary.json
scores.csv
model/param_tag: Parsed from thepred_dirpath name (e.g.,.../my-model-temp-0.2).task: Inferred task category (e.g.,impl,arch,evol,mixed).summary.json: Comprehensive statistics and detailed results for every example.scores.csv: A simplified list of tasks and scores, including the overall average.
You can strictly define which examples to evaluate using the --examples flag:
python evaluate.py batch \
--pred_dir /path/to/pred_dir \
--gold_dir ./gold \
--config evaluation_config_compare.yaml \
--mode cs \
--output_dir ./results_cs \
--examples dacomp-de-impl-001 dacomp-de-impl-002