DAComp-DE Evaluation Suite Usage Guide

This directory contains the unified evaluation suite for DAComp-DE (Data Engineering) tasks. It is designed to automatically score DuckDB databases produced by candidate data pipelines. The primary entry point is evaluate.py, with core logic implemented in utils.py.

Directory Structure

evaluate.py: The command-line entry point supporting both single-task and batch evaluation.
utils.py: Handles threshold validation, core consistency checks, and layered/mixed scoring logic.
evaluation_config_compare.yaml: The evaluation configuration file defining database filenames, layer hierarchies (e.g., staging, intermediate, marts), table definitions, and scoring weights for each example.
gold/: The directory for Ground Truth (Gold) data. It stores the standard DuckDB database files organized by example ID (e.g., gold/dacomp-de-impl-001/lever.duckdb).
README.md: This documentation.

Prerequisites

We recommend using Python 3.9+. Install the required dependencies using the following command:

pip install duckdb pandas numpy pyyaml

Prediction Directory (`pred_dir`) Convention

The evaluation script expects the Prediction Directory (pred_dir) to follow the structure below:

<pred_dir>/
  dacomp-de-impl-001/
    run.py
    lever.duckdb           # Generated by run.py
    config/...
    sql/...
  dacomp-de-impl-002/
    run.py
    pendo.duckdb
    ...

Key Requirements:

Each example must have its own subdirectory named after its example_id (e.g., dacomp-de-impl-001).
Each subdirectory must contain an executable run.py script responsible for building the corresponding DuckDB database file (e.g., lever.duckdb, pendo.duckdb).
The expected database filename is defined in evaluation_config_compare.yaml under examples.<example_id>.database_file.

Evaluation Process:

Execution: The script runs pred_dir/<example_id>/run.py to generate or update the DuckDB database. (Use --force-rebuild to delete existing databases and enforce a fresh run).
Comparison: The script compares the generated database at pred_dir/<example_id>/<database_file> against the ground truth at gold/<config_id>/<database_file>.

Note: The config_id used in the Gold directory is derived from the configuration file. While usually identical to the example_id in the prediction directory, they are associated via prefix matching.

Gold Directory (`gold_dir`) Convention

The gold_dir should be structured as follows:

<gold_dir>/
  dacomp-de-impl-001/
    lever.duckdb
  dacomp-de-impl-002/
    pendo.duckdb
  ...

Subdirectory names usually correspond to the example IDs in the configuration file.
Each subdirectory contains the ground truth DuckDB database file, as specified by the database_file field in the configuration.

Evaluation Modes

cfs (Default): Evaluates the entire pipeline by layer and table on the generated DuckDB database. This checks the overall effectiveness of the pipeline.
cs: Performs a step-by-step correctness check by running your SQL within the Gold environment table by table. This is useful for debugging specific table errors.
- Warning: This mode requires significant memory. We recommend allocating at least 16GB RAM.

Usage

All commands should be executed from the dacomp-de/evaluation_suite directory:

cd dacomp-de/evaluation_suite

1. Single Evaluation

To evaluate a specific example:

python evaluate.py single \
  --pred_dir /path/to/pred_dir \
  --example_id dacomp-de-impl-001 \
  --gold_dir ./gold \
  --config evaluation_config_compare.yaml \
  --mode cfs \
  --force-rebuild \
  --output result_impl_001.json

Parameters:

--pred_dir: The root directory containing prediction subdirectories.
--gold_dir: The root directory for ground truth data (default: ./gold).
--config: The evaluation configuration file (default: evaluation_config_compare.yaml).
--example_id: The specific example directory name within pred_dir to evaluate.
--output: (Optional) File path to save the full JSON result. If omitted, key results are printed to stdout.
--force-rebuild: Forces the deletion of existing DuckDB files and re-runs run.py.
--mode: cfs (default) or cs.

Output:

Terminal: Displays Final Score: xx.xx and Evaluation Level (e.g., core_perfect_match, partial_evaluation).
JSON File (if --output is set): Contains detailed metrics including threshold_evaluation (schema checks), core_accuracy_evaluation, partial_evaluation (layer-wise scores), and the final summary.

2. Batch Evaluation

To evaluate multiple examples within a pred_dir and generate a summary report.

A. Automatic Discovery

If --examples is not provided, the script scans pred_dir for all subdirectories matching the configuration IDs:

python evaluate.py batch \
  --pred_dir /path/to/pred_dir \
  --gold_dir ./gold \
  --config evaluation_config_compare.yaml \
  --mode cfs \
  --force-rebuild \
  --output_dir ./results

Results Structure: The script generates a directory structure in output_dir based on parsed metadata:

./results/<model>/<task>/<param_tag>/<mode>/
  summary.json
  scores.csv

model / param_tag: Parsed from the pred_dir path name (e.g., .../my-model-temp-0.2).
task: Inferred task category (e.g., impl, arch, evol, mixed).
summary.json: Comprehensive statistics and detailed results for every example.
scores.csv: A simplified list of tasks and scores, including the overall average.

B. Explicit List

You can strictly define which examples to evaluate using the --examples flag:

python evaluate.py batch \
  --pred_dir /path/to/pred_dir \
  --gold_dir ./gold \
  --config evaluation_config_compare.yaml \
  --mode cs \
  --output_dir ./results_cs \
  --examples dacomp-de-impl-001 dacomp-de-impl-002

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAComp-DE Evaluation Suite Usage Guide

Directory Structure

Prerequisites

Prediction Directory (`pred_dir`) Convention

Gold Directory (`gold_dir`) Convention

Evaluation Modes

Usage

1. Single Evaluation

2. Batch Evaluation

A. Automatic Discovery

B. Explicit List

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DAComp-DE Evaluation Suite Usage Guide

Directory Structure

Prerequisites

Prediction Directory (pred_dir) Convention

Gold Directory (gold_dir) Convention

Evaluation Modes

Usage

1. Single Evaluation

2. Batch Evaluation

A. Automatic Discovery

B. Explicit List

Prediction Directory (`pred_dir`) Convention

Gold Directory (`gold_dir`) Convention