Support workload group ONNXPY

# Support workload group ONNXPY

## Title

**Support workload group ONNXPY (Python-generated ONNX)**

## Description

### Summary

Add a new workload group **ONNXPY** that behaves like ONNX except the ONNX file is produced at run time by running a Python script. The script is invoked with a temporary output path; on success, the workload is executed like ONNX using that file; on failure, the entire workload is skipped. The temporary file is removed after all instances of the workload have been processed.

### Workload config YAML (differences from ONNX)

- **No `path` in instances**  
Instance entries do not include a `path` attribute (unlike ONNX, where each instance has e.g. `path: 'onnx/resnet50.onnx'`).
- `**module` without `@` qualifier**  
The `module` field is a single script path (e.g. `onnx/generate_resnet.py`), not a qualified name like `ResNet@basicresnet.py`. It identifies the Python script that will be run to generate the ONNX file.

Otherwise the block looks like ONNX: `api: ONNXPY`, `name`, `basedir`, `instances` (and optionally `params` if desired).

**Example (illustrative):**

```yaml
- api: ONNXPY
  name: RESNET50
  basedir: workloads
  module: onnx/generate_resnet50.py
  instances:
    rn50_b1_hd : { img_height: 1024, img_width: 1024, bs: 1 }
```

### Execution flow

1. **Pre-step (once per ONNXPY workload, before running its instances)**
  - Generate a temporary file name (e.g. via `tempfile.NamedTemporaryFile(delete=False)` or equivalent).  
  - Run the Python script given by `module` (resolved relative to `basedir`), passing the temp path via `--output <path>`.  
  - If the script exits with a non-zero code: **skip the entire workload** (do not run any instance of this workload).  
  - If the script succeeds: record the temp path for this `(wlgroup, wlname)` and use it as the ONNX path for all instances of this workload.
2. **Per-instance handling (same as ONNX)**
  - For each instance of an ONNXPY workload that was not skipped, treat it like ONNX: use the generated temp path as the model path and call the same ONNX pipeline (e.g. `onnx2graph(wli, wpath)` as at [polaris.py line 132](polaris.py)).  
  - Instance config (e.g. `bs`, `img_height`, `img_width`) is used as today for ONNX (e.g. in `wlcfg` and downstream stats).
3. **Cleanup**
  - After successfully processing **all** instances of a given ONNXPY workload, remove the temporary file.  
  - If the workload was skipped (script failed), there is no temp file to remove.

### Code / config touchpoints

- **Workload spec / validation**  
  - Add an ONNXPY workload model (e.g. `PYDWorkloadONNXPYModelValidator` in [ttsim/config/validators.py](ttsim/config/validators.py)) with `api: Literal['ONNXPY']`, no `path` in instance config, and `module` as a plain string (no `@`).  
  - Extend `AnyWorkload` (and simconfig’s workload class table) so ONNXPY is a recognized API and `get_instances()` for ONNXPY returns instance configs **without** a `path` key (path will be supplied at run time).
- **Polaris driver**  
  - [polaris.py](polaris.py):  
    - **Pre-phase:** Before the main experiment loop (e.g. before or at the start of `execute_wl_on_dev`), for each unique `(wlgroup, wlname)` where `wlgroup == 'ONNXPY'`: run the script with `--output <tempfile>`, and build a map `(wlgroup, wlname) -> temp_path` on success, or mark that workload as skipped on non-zero exit. Optionally filter the workload list so skipped ONNXPY workloads are not iterated.  
    - **Path resolution:** In the loop where `wlpath = wlins_cfg['path']` is used ([polaris.py around line 321](polaris.py)), for ONNXPY use the precomputed temp path from the map instead of reading `path` from `wlins_cfg`.  
    - **Graph construction:** In `get_wlgraph` ([polaris.py around line 130](polaris.py)), add a branch for `wlg == 'ONNXPY'` that mirrors the ONNX branch: same `onnx2graph(wli, wpath)` and same perf/count handling.  
    - **Cleanup:** After the main loop over experiments, for each ONNXPY workload that was run (has an entry in the temp-path map), remove the corresponding temporary file (e.g. `os.remove` or `Path.unlink`), and handle errors if the file was already removed.
- **Script invocation**  
  - Resolve `module` relative to `basedir` (and optionally set `cwd` to repo root or `basedir` when running the script; document the chosen behavior).  
  - Invoke as: `python <module_path> --output <temp_file_path>`.  
  - No other CLI args are specified in this issue; optional follow-up could add passing workload/instance params if needed.

### Edge cases / notes

- **Dry run:** For `--dryrun`, do not run the script; either skip ONNXPY entries or show them as “would run script and then ONNX” without creating temp files.  
- **Filtering:** If `--filterwlg` / `--filterwl` / `--filterwli` exclude some instances, cleanup should still remove the temp file once all **in-scope** instances of that ONNXPY workload have been processed.  
- **Failure handling:** Only script exit code is used to decide “skip workload”; any exception during script execution (e.g. missing interpreter) can be treated as failure and workload skipped.

### Acceptance criteria

- Config schema allows `api: ONNXPY` with `module` (no `@`) and instances without `path`.  
- Polaris runs the module script once per ONNXPY workload with `--output <tempfile>`; on non-zero exit, that workload’s instances are skipped.  
- Each instance of a successful ONNXPY workload is executed like ONNX (same graph and stats path as [polaris.py line 130](polaris.py)).  
- Temporary file is removed after all instances of that ONNXPY workload have been processed.  
- Dry run does not create or leave temp files.

---

## Optional clarifications (for implementer or follow-up)

- **Working directory:** When running the script, should `cwd` be the repo root, `basedir`, or the directory containing the script?  
- **Extra script arguments:** Should the script receive only `--output`, or also workload/instance params (e.g. from `params` or instance config)?  
- **Params:** Should ONNXPY support an optional top-level `params` block like ONNX/TTSIM for merging into instance config?


@ramamalladiTT 
@lyen1 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support workload group ONNXPY #329

Support workload group ONNXPY

Title

Description

Summary

Workload config YAML (differences from ONNX)

Execution flow

Code / config touchpoints

Edge cases / notes

Acceptance criteria

Optional clarifications (for implementer or follow-up)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support workload group ONNXPY #329

Description

Support workload group ONNXPY

Title

Description

Summary

Workload config YAML (differences from ONNX)

Execution flow

Code / config touchpoints

Edge cases / notes

Acceptance criteria

Optional clarifications (for implementer or follow-up)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions