Skip to content

Latest commit

 

History

History
155 lines (114 loc) · 5.96 KB

File metadata and controls

155 lines (114 loc) · 5.96 KB

DAComp-DE Evaluation Suite Usage Guide

This directory contains the unified evaluation suite for DAComp-DE (Data Engineering) tasks. It is designed to automatically score DuckDB databases produced by candidate data pipelines. The primary entry point is evaluate.py, with core logic implemented in utils.py.

Directory Structure

  • evaluate.py: The command-line entry point supporting both single-task and batch evaluation.
  • utils.py: Handles threshold validation, core consistency checks, and layered/mixed scoring logic.
  • evaluation_config_compare.yaml: The evaluation configuration file defining database filenames, layer hierarchies (e.g., staging, intermediate, marts), table definitions, and scoring weights for each example.
  • gold/: The directory for Ground Truth (Gold) data. It stores the standard DuckDB database files organized by example ID (e.g., gold/dacomp-de-impl-001/lever.duckdb).
  • README.md: This documentation.

Prerequisites

We recommend using Python 3.9+. Install the required dependencies using the following command:

pip install duckdb pandas numpy pyyaml

Prediction Directory (pred_dir) Convention

The evaluation script expects the Prediction Directory (pred_dir) to follow the structure below:

<pred_dir>/
  dacomp-de-impl-001/
    run.py
    lever.duckdb           # Generated by run.py
    config/...
    sql/...
  dacomp-de-impl-002/
    run.py
    pendo.duckdb
    ...

Key Requirements:

  • Each example must have its own subdirectory named after its example_id (e.g., dacomp-de-impl-001).
  • Each subdirectory must contain an executable run.py script responsible for building the corresponding DuckDB database file (e.g., lever.duckdb, pendo.duckdb).
  • The expected database filename is defined in evaluation_config_compare.yaml under examples.<example_id>.database_file.

Evaluation Process:

  1. Execution: The script runs pred_dir/<example_id>/run.py to generate or update the DuckDB database. (Use --force-rebuild to delete existing databases and enforce a fresh run).
  2. Comparison: The script compares the generated database at pred_dir/<example_id>/<database_file> against the ground truth at gold/<config_id>/<database_file>.

Note: The config_id used in the Gold directory is derived from the configuration file. While usually identical to the example_id in the prediction directory, they are associated via prefix matching.

Gold Directory (gold_dir) Convention

The gold_dir should be structured as follows:

<gold_dir>/
  dacomp-de-impl-001/
    lever.duckdb
  dacomp-de-impl-002/
    pendo.duckdb
  ...
  • Subdirectory names usually correspond to the example IDs in the configuration file.
  • Each subdirectory contains the ground truth DuckDB database file, as specified by the database_file field in the configuration.

Evaluation Modes

  • cfs (Default): Evaluates the entire pipeline by layer and table on the generated DuckDB database. This checks the overall effectiveness of the pipeline.
  • cs: Performs a step-by-step correctness check by running your SQL within the Gold environment table by table. This is useful for debugging specific table errors.
    • Warning: This mode requires significant memory. We recommend allocating at least 16GB RAM.

Usage

All commands should be executed from the dacomp-de/evaluation_suite directory:

cd dacomp-de/evaluation_suite

1. Single Evaluation

To evaluate a specific example:

python evaluate.py single \
  --pred_dir /path/to/pred_dir \
  --example_id dacomp-de-impl-001 \
  --gold_dir ./gold \
  --config evaluation_config_compare.yaml \
  --mode cfs \
  --force-rebuild \
  --output result_impl_001.json

Parameters:

  • --pred_dir: The root directory containing prediction subdirectories.
  • --gold_dir: The root directory for ground truth data (default: ./gold).
  • --config: The evaluation configuration file (default: evaluation_config_compare.yaml).
  • --example_id: The specific example directory name within pred_dir to evaluate.
  • --output: (Optional) File path to save the full JSON result. If omitted, key results are printed to stdout.
  • --force-rebuild: Forces the deletion of existing DuckDB files and re-runs run.py.
  • --mode: cfs (default) or cs.

Output:

  • Terminal: Displays Final Score: xx.xx and Evaluation Level (e.g., core_perfect_match, partial_evaluation).
  • JSON File (if --output is set): Contains detailed metrics including threshold_evaluation (schema checks), core_accuracy_evaluation, partial_evaluation (layer-wise scores), and the final summary.

2. Batch Evaluation

To evaluate multiple examples within a pred_dir and generate a summary report.

A. Automatic Discovery

If --examples is not provided, the script scans pred_dir for all subdirectories matching the configuration IDs:

python evaluate.py batch \
  --pred_dir /path/to/pred_dir \
  --gold_dir ./gold \
  --config evaluation_config_compare.yaml \
  --mode cfs \
  --force-rebuild \
  --output_dir ./results

Results Structure: The script generates a directory structure in output_dir based on parsed metadata:

./results/<model>/<task>/<param_tag>/<mode>/
  summary.json
  scores.csv
  • model / param_tag: Parsed from the pred_dir path name (e.g., .../my-model-temp-0.2).
  • task: Inferred task category (e.g., impl, arch, evol, mixed).
  • summary.json: Comprehensive statistics and detailed results for every example.
  • scores.csv: A simplified list of tasks and scores, including the overall average.

B. Explicit List

You can strictly define which examples to evaluate using the --examples flag:

python evaluate.py batch \
  --pred_dir /path/to/pred_dir \
  --gold_dir ./gold \
  --config evaluation_config_compare.yaml \
  --mode cs \
  --output_dir ./results_cs \
  --examples dacomp-de-impl-001 dacomp-de-impl-002