Skip to content

Commit c2acb86

Browse files
committed
Standardize dataset preparation in validation proposal
Signed-off-by: Kai-Wei Chou <contact@kaiwei.dev>
1 parent 5844ef4 commit c2acb86

1 file changed

Lines changed: 23 additions & 7 deletions

File tree

docs/proposals/scenarios/example-restoration-ci/example-restoration-ci.md

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -283,14 +283,16 @@ The proposed framework adds a validation layer around existing Ianvs examples.
283283
Ianvs Repository
284284
├── examples/
285285
│ ├── llm_simple_qa/
286+
│ │ └── scripts/
287+
│ │ └── prepare_dataset.py
286288
│ ├── example A/
287289
│ ├── example B/
288290
│ └── ...
289291
290292
├── tools/
291293
│ └── example_validation/
292294
| ├── data/
293-
| | └── example_inventory.json
295+
| | └── example_inventory.yaml
294296
│ ├── validate_examples.py
295297
│ ├── inventory.py
296298
│ ├── static_validator.py
@@ -315,12 +317,13 @@ The responsibilities of the proposed files are:
315317
| Path | Responsibility |
316318
|---|---|
317319
| `examples/` | Stores Ianvs example projects, including their runnable configurations, documentation, dependency references, dataset references, and algorithm-related files. These directories are the validation targets of the framework. |
318-
| `tools/example_validation/data/example_inventory.json` | Stores the example inventory and classification metadata, including each example's path, validation level, dataset requirements, dependency requirements, model requirements, hardware requirements, and current status. |
320+
| `examples/<example_name>/scripts/prepare_dataset.py` | Provides the standard dataset preparation entry point for examples that support automated dataset setup. It should download, generate, or normalize the required dataset into the documented directory structure from a clean environment. |
321+
| `tools/example_validation/data/example_inventory.yaml` | Stores the example inventory and classification metadata, including each example's path, validation level, dataset requirements, dependency requirements, model requirements, hardware requirements, current status, expected dataset structure, and whether the dataset is external when automated preparation is unavailable. |
319322
| `tools/example_validation/validate_examples.py` | Serves as the main entry point for local and CI validation. It should parse CLI arguments, load the inventory, select validation stages, invoke the validator modules, and coordinate report generation. |
320323
| `tools/example_validation/inventory.py` | Loads and manages the example inventory. It should provide structured metadata access, helper logic for selecting changed or affected examples, and shared inventory operations used by the validation pipeline. |
321324
| `tools/example_validation/static_validator.py` | Performs lightweight static checks without executing examples. It should detect problems such as missing files, invalid YAML, broken relative paths, hardcoded local paths, outdated repository layout references, README and configuration mismatches, local-only model paths, and CUDA-only assumptions. |
322325
| `tools/example_validation/dependency_validator.py` | Validates whether example dependencies are properly declared and installable. It should check dependency file presence, package installation behavior, Python version compatibility, and dependency-related failures that block clean-environment execution. |
323-
| `tools/example_validation/dataset_validator.py` | Validates dataset-related requirements and lightweight data structure correctness. It should check dataset path consistency, external dataset documentation, and format validity for files such as JSONL. |
326+
| `tools/example_validation/dataset_validator.py` | Validates dataset-related requirements and lightweight data structure correctness. It should check dataset path consistency, `prepare_dataset.py` availability when automation is supported, declared dataset structure in the inventory, `external` classification when automation is unavailable, and format validity for files such as JSONL. |
324327
| `tools/example_validation/smoke_test_runner.py` | Runs lightweight execution tests for selected examples to confirm that they can start and complete a minimal validation run in CI without requiring full benchmark workloads where possible. |
325328
| `tools/example_validation/report_generator.py` | Converts validation results into human-readable CI summaries and example health reports, including failure classifications, reproduction commands, and suggested next actions for contributors and maintainers. |
326329
| `docs/example_validation/validation_rules.md` | Documents the validation rules implemented by the framework, including what each validator checks, why the rule exists, and how maintainers should interpret its result. |
@@ -378,8 +381,15 @@ examples:
378381
path: examples/llm_simple_qa
379382
benchmark_config: benchmarkingjob.yaml
380383
requirement_file: examples/llm_simple_qa/requestment.txt
381-
dataset_required: true
382-
dataset_format: jsonl
384+
dataset:
385+
required: true
386+
external: false
387+
prepare_script: examples/llm_simple_qa/scripts/prepare_dataset.py
388+
root: dataset/llm_simple_qa
389+
structure:
390+
- train_data/data.jsonl
391+
- test_data/data.jsonl
392+
format: jsonl
383393
model_required: true
384394
gpu_required: false
385395
validation_level: smoke
@@ -417,6 +427,7 @@ Checks:
417427
* Dataset format mismatch between README, YAML, and runtime code
418428
* README contains dependency installation instructions
419429
* README contains dataset preparation instructions
430+
* README references the standard `prepare_dataset.py` flow when the example supports automated dataset setup
420431
* README contains JSONL format when applicable
421432
* README contains model configuration instructions when applicable
422433
* README paths match YAML paths
@@ -479,6 +490,7 @@ Static validation should be lightweight enough to run across relevant examples o
479490
For `examples/llm_simple_qa`, static validation should also confirm:
480491

481492
* The README explains the example overview, setup steps, dependency installation, dataset preparation, JSONL format, model configuration, run command, expected output, and troubleshooting.
493+
* Dataset preparation uses `prepare_dataset.py` when the example supports automated setup, and the documented dataset layout matches the structure declared in `example_inventory.yaml`.
482494
* Model loading uses a portable model ID or a documented override mechanism instead of local-only paths.
483495
* Device selection supports CUDA, MPS, and CPU fallback rather than assuming CUDA-only execution.
484496
* Metric handling avoids crashes when no valid prediction-answer pairs exist, for example by returning `0.0` and logging a warning instead of triggering `ZeroDivisionError`.
@@ -541,12 +553,15 @@ Purpose:
541553

542554
Checks:
543555

544-
* Dataset path exists or is documented as external
556+
* Dataset path exists or is declared in `example_inventory.yaml`
545557
* Dataset path matches README and YAML references
558+
* `prepare_dataset.py` exists for examples that support automated dataset setup
559+
* `example_inventory.yaml` declares the expected dataset directory structure
560+
* If automated dataset setup is unavailable, the example inventory marks the dataset as `external: true`
546561
* JSONL files are not empty
547562
* Each JSONL line is a complete JSON object
548563
* Required fields are present
549-
* Dataset generation script exists when data is not committed
564+
* `prepare_dataset.py` produces or documents the expected dataset layout when data is not committed
550565

551566
For `examples/llm_simple_qa`, the expected dataset layout may be:
552567

@@ -567,6 +582,7 @@ Each JSONL line should be one complete JSON object, for example:
567582
Example validation commands:
568583

569584
```bash
585+
python examples/llm_simple_qa/scripts/prepare_dataset.py
570586
python examples/llm_simple_qa/scripts/validate_jsonl.py dataset/llm_simple_qa/train_data/data.jsonl
571587
python examples/llm_simple_qa/scripts/validate_jsonl.py dataset/llm_simple_qa/test_data/data.jsonl
572588
```

0 commit comments

Comments
 (0)