-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Support workload group ONNXPY
Title
Support workload group ONNXPY (Python-generated ONNX)
Description
Summary
Add a new workload group ONNXPY that behaves like ONNX except the ONNX file is produced at run time by running a Python script. The script is invoked with a temporary output path; on success, the workload is executed like ONNX using that file; on failure, the entire workload is skipped. The temporary file is removed after all instances of the workload have been processed.
Workload config YAML (differences from ONNX)
- No
pathin instances
Instance entries do not include apathattribute (unlike ONNX, where each instance has e.g.path: 'onnx/resnet50.onnx'). **modulewithout@qualifier**
Themodulefield is a single script path (e.g.onnx/generate_resnet.py), not a qualified name likeResNet@basicresnet.py. It identifies the Python script that will be run to generate the ONNX file.
Otherwise the block looks like ONNX: api: ONNXPY, name, basedir, instances (and optionally params if desired).
Example (illustrative):
- api: ONNXPY
name: RESNET50
basedir: workloads
module: onnx/generate_resnet50.py
instances:
rn50_b1_hd : { img_height: 1024, img_width: 1024, bs: 1 }Execution flow
- Pre-step (once per ONNXPY workload, before running its instances)
- Generate a temporary file name (e.g. via
tempfile.NamedTemporaryFile(delete=False)or equivalent). - Run the Python script given by
module(resolved relative tobasedir), passing the temp path via--output <path>. - If the script exits with a non-zero code: skip the entire workload (do not run any instance of this workload).
- If the script succeeds: record the temp path for this
(wlgroup, wlname)and use it as the ONNX path for all instances of this workload.
- Per-instance handling (same as ONNX)
- For each instance of an ONNXPY workload that was not skipped, treat it like ONNX: use the generated temp path as the model path and call the same ONNX pipeline (e.g.
onnx2graph(wli, wpath)as at polaris.py line 132). - Instance config (e.g.
bs,img_height,img_width) is used as today for ONNX (e.g. inwlcfgand downstream stats).
- Cleanup
- After successfully processing all instances of a given ONNXPY workload, remove the temporary file.
- If the workload was skipped (script failed), there is no temp file to remove.
Code / config touchpoints
- Workload spec / validation
- Add an ONNXPY workload model (e.g.
PYDWorkloadONNXPYModelValidatorin ttsim/config/validators.py) withapi: Literal['ONNXPY'], nopathin instance config, andmoduleas a plain string (no@). - Extend
AnyWorkload(and simconfig’s workload class table) so ONNXPY is a recognized API andget_instances()for ONNXPY returns instance configs without apathkey (path will be supplied at run time).
- Add an ONNXPY workload model (e.g.
- Polaris driver
- polaris.py:
- Pre-phase: Before the main experiment loop (e.g. before or at the start of
execute_wl_on_dev), for each unique(wlgroup, wlname)wherewlgroup == 'ONNXPY': run the script with--output <tempfile>, and build a map(wlgroup, wlname) -> temp_pathon success, or mark that workload as skipped on non-zero exit. Optionally filter the workload list so skipped ONNXPY workloads are not iterated. - Path resolution: In the loop where
wlpath = wlins_cfg['path']is used (polaris.py around line 321), for ONNXPY use the precomputed temp path from the map instead of readingpathfromwlins_cfg. - Graph construction: In
get_wlgraph(polaris.py around line 130), add a branch forwlg == 'ONNXPY'that mirrors the ONNX branch: sameonnx2graph(wli, wpath)and same perf/count handling. - Cleanup: After the main loop over experiments, for each ONNXPY workload that was run (has an entry in the temp-path map), remove the corresponding temporary file (e.g.
os.removeorPath.unlink), and handle errors if the file was already removed.
- Pre-phase: Before the main experiment loop (e.g. before or at the start of
- polaris.py:
- Script invocation
- Resolve
modulerelative tobasedir(and optionally setcwdto repo root orbasedirwhen running the script; document the chosen behavior). - Invoke as:
python <module_path> --output <temp_file_path>. - No other CLI args are specified in this issue; optional follow-up could add passing workload/instance params if needed.
- Resolve
Edge cases / notes
- Dry run: For
--dryrun, do not run the script; either skip ONNXPY entries or show them as “would run script and then ONNX” without creating temp files. - Filtering: If
--filterwlg/--filterwl/--filterwliexclude some instances, cleanup should still remove the temp file once all in-scope instances of that ONNXPY workload have been processed. - Failure handling: Only script exit code is used to decide “skip workload”; any exception during script execution (e.g. missing interpreter) can be treated as failure and workload skipped.
Acceptance criteria
- Config schema allows
api: ONNXPYwithmodule(no@) and instances withoutpath. - Polaris runs the module script once per ONNXPY workload with
--output <tempfile>; on non-zero exit, that workload’s instances are skipped. - Each instance of a successful ONNXPY workload is executed like ONNX (same graph and stats path as polaris.py line 130).
- Temporary file is removed after all instances of that ONNXPY workload have been processed.
- Dry run does not create or leave temp files.
Optional clarifications (for implementer or follow-up)
- Working directory: When running the script, should
cwdbe the repo root,basedir, or the directory containing the script? - Extra script arguments: Should the script receive only
--output, or also workload/instance params (e.g. fromparamsor instance config)? - Params: Should ONNXPY support an optional top-level
paramsblock like ONNX/TTSIM for merging into instance config?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels