Standardize dataset preparation in validation proposal

ken6078 · ken6078 · commit c2acb8640bc4 · 2026-06-25T02:08:45.000+08:00
Signed-off-by: Kai-Wei Chou &lt;contact@kaiwei.dev&gt;
diff --git a/docs/proposals/scenarios/example-restoration-ci/example-restoration-ci.md b/docs/proposals/scenarios/example-restoration-ci/example-restoration-ci.md
@@ -283,14 +283,16 @@ The proposed framework adds a validation layer around existing Ianvs examples.
 Ianvs Repository
 ├── examples/
 │   ├── llm_simple_qa/
+│   │   └── scripts/
+│   │       └── prepare_dataset.py
 │   ├── example A/
 │   ├── example B/
 │   └── ...
 │
 ├── tools/
 │   └── example_validation/
 |       ├── data/
-|       |   └── example_inventory.json
+|       |   └── example_inventory.yaml
 │       ├── validate_examples.py
 │       ├── inventory.py
 │       ├── static_validator.py
@@ -315,12 +317,13 @@ The responsibilities of the proposed files are:
 | Path | Responsibility |
 |---|---|
 | `examples/` | Stores Ianvs example projects, including their runnable configurations, documentation, dependency references, dataset references, and algorithm-related files. These directories are the validation targets of the framework. |
-| `tools/example_validation/data/example_inventory.json` | Stores the example inventory and classification metadata, including each example's path, validation level, dataset requirements, dependency requirements, model requirements, hardware requirements, and current status. |
+| `examples/<example_name>/scripts/prepare_dataset.py` | Provides the standard dataset preparation entry point for examples that support automated dataset setup. It should download, generate, or normalize the required dataset into the documented directory structure from a clean environment. |
+| `tools/example_validation/data/example_inventory.yaml` | Stores the example inventory and classification metadata, including each example's path, validation level, dataset requirements, dependency requirements, model requirements, hardware requirements, current status, expected dataset structure, and whether the dataset is external when automated preparation is unavailable. |
 | `tools/example_validation/validate_examples.py` | Serves as the main entry point for local and CI validation. It should parse CLI arguments, load the inventory, select validation stages, invoke the validator modules, and coordinate report generation. |
 | `tools/example_validation/inventory.py` | Loads and manages the example inventory. It should provide structured metadata access, helper logic for selecting changed or affected examples, and shared inventory operations used by the validation pipeline. |
 | `tools/example_validation/static_validator.py` | Performs lightweight static checks without executing examples. It should detect problems such as missing files, invalid YAML, broken relative paths, hardcoded local paths, outdated repository layout references, README and configuration mismatches, local-only model paths, and CUDA-only assumptions. |
 | `tools/example_validation/dependency_validator.py` | Validates whether example dependencies are properly declared and installable. It should check dependency file presence, package installation behavior, Python version compatibility, and dependency-related failures that block clean-environment execution. |
-| `tools/example_validation/dataset_validator.py` | Validates dataset-related requirements and lightweight data structure correctness. It should check dataset path consistency, external dataset documentation, and format validity for files such as JSONL. |
+| `tools/example_validation/dataset_validator.py` | Validates dataset-related requirements and lightweight data structure correctness. It should check dataset path consistency, `prepare_dataset.py` availability when automation is supported, declared dataset structure in the inventory, `external` classification when automation is unavailable, and format validity for files such as JSONL. |
 | `tools/example_validation/smoke_test_runner.py` | Runs lightweight execution tests for selected examples to confirm that they can start and complete a minimal validation run in CI without requiring full benchmark workloads where possible. |
 | `tools/example_validation/report_generator.py` | Converts validation results into human-readable CI summaries and example health reports, including failure classifications, reproduction commands, and suggested next actions for contributors and maintainers. |
 | `docs/example_validation/validation_rules.md` | Documents the validation rules implemented by the framework, including what each validator checks, why the rule exists, and how maintainers should interpret its result. |
@@ -378,8 +381,15 @@ examples:
     path: examples/llm_simple_qa
     benchmark_config: benchmarkingjob.yaml
     requirement_file: examples/llm_simple_qa/requestment.txt
-    dataset_required: true
-    dataset_format: jsonl
+    dataset:
+      required: true
+      external: false
+      prepare_script: examples/llm_simple_qa/scripts/prepare_dataset.py
+      root: dataset/llm_simple_qa
+      structure:
+        - train_data/data.jsonl
+        - test_data/data.jsonl
+      format: jsonl
     model_required: true
     gpu_required: false
     validation_level: smoke
@@ -417,6 +427,7 @@ Checks:
 * Dataset format mismatch between README, YAML, and runtime code
 * README contains dependency installation instructions
 * README contains dataset preparation instructions
+* README references the standard `prepare_dataset.py` flow when the example supports automated dataset setup
 * README contains JSONL format when applicable
 * README contains model configuration instructions when applicable
 * README paths match YAML paths
@@ -479,6 +490,7 @@ Static validation should be lightweight enough to run across relevant examples o
 For `examples/llm_simple_qa`, static validation should also confirm:
 
 * The README explains the example overview, setup steps, dependency installation, dataset preparation, JSONL format, model configuration, run command, expected output, and troubleshooting.
+* Dataset preparation uses `prepare_dataset.py` when the example supports automated setup, and the documented dataset layout matches the structure declared in `example_inventory.yaml`.
 * Model loading uses a portable model ID or a documented override mechanism instead of local-only paths.
 * Device selection supports CUDA, MPS, and CPU fallback rather than assuming CUDA-only execution.
 * Metric handling avoids crashes when no valid prediction-answer pairs exist, for example by returning `0.0` and logging a warning instead of triggering `ZeroDivisionError`.
@@ -541,12 +553,15 @@ Purpose:
 
 Checks:
 
-* Dataset path exists or is documented as external
+* Dataset path exists or is declared in `example_inventory.yaml`
 * Dataset path matches README and YAML references
+* `prepare_dataset.py` exists for examples that support automated dataset setup
+* `example_inventory.yaml` declares the expected dataset directory structure
+* If automated dataset setup is unavailable, the example inventory marks the dataset as `external: true`
 * JSONL files are not empty
 * Each JSONL line is a complete JSON object
 * Required fields are present
-* Dataset generation script exists when data is not committed
+* `prepare_dataset.py` produces or documents the expected dataset layout when data is not committed
 
 For `examples/llm_simple_qa`, the expected dataset layout may be:
 
@@ -567,6 +582,7 @@ Each JSONL line should be one complete JSON object, for example:
 Example validation commands:
 
 ```bash
+python examples/llm_simple_qa/scripts/prepare_dataset.py
 python examples/llm_simple_qa/scripts/validate_jsonl.py dataset/llm_simple_qa/train_data/data.jsonl
 python examples/llm_simple_qa/scripts/validate_jsonl.py dataset/llm_simple_qa/test_data/data.jsonl
 ```