Skip to content

Adds inference script (incl. force calculation) to Unifed External Aero Recipe#1706

Open
peterdsharpe wants to merge 25 commits into
NVIDIA:mainfrom
peterdsharpe:psharpe/add-unified-recipe-inference-script
Open

Adds inference script (incl. force calculation) to Unifed External Aero Recipe#1706
peterdsharpe wants to merge 25 commits into
NVIDIA:mainfrom
peterdsharpe:psharpe/add-unified-recipe-inference-script

Conversation

@peterdsharpe

@peterdsharpe peterdsharpe commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

Description

This PR modifies the Unified External Aero Recipe (examples/cfd/external_aerodynamics/unified_external_aero_recipe), to:

  • Add an inference script (src/infer.py and conf/infer.yaml) to the Unified External Aero Recipe
  • As part of this, add integrated force/moment coefficients to the recipe.

In order to pull this off cleanly (i.e., without massive duplication between the pre-existing train.py script and infer.py), some tooling was refactored within the recipe. In general this means:

  • train.py got a lot shorter.
  • utils.py and datasets.py got a lot longer, mostly with stuff that was pulled out of train.py.
  • infer.py can now re-use most of that shared tooling :)

So, in-net, for code reviewers, I would recommend:

  • Check out the modified README.md for an overview.
  • Review that the refactor looks sensible, namely train.py -> utils.py and datasets.py.
  • Review the new infer.py, along with the new YAMLs, and let me know if it's understandable.
  • Review the new forces.py, which is used for force calculation.

All other changes are pretty peripheral and/or derived from this.


Some more detailed notes on how the inference script works:

infer.py is a companion to train.py that loads a trained checkpoint, runs it over a split, and writes predictions back to disk:

  • Model/dataset-agnostic. It keys off the same input_type / output_type / forward_kwargs / targets contract as training, so it works for every model in the recipe (GeoTransolver, Transolver, FLARE, GLOBE, ...) with no per-model code. It reuses build_dataloaders, the collate, normalize_output_to_tensordict, and MetricCalculator directly (which now live in datasets.py).
  • Output. One native .pdmsh DomainMesh per sample under ${output_dir}/${run_id}/predictions/, with pred_<field> and true_<field> on the interior, plus a metrics.jsonl.
  • Units. Metrics are reported in training space (so they line up with validation numbers); written fields are re-dimensionalized to physical units by inverting normalization then non-dimensionalization. An optional rescale_geometry restores physical-scale coordinates.
  • conf/infer.yaml shares conf/base.yaml with the trainer and swaps the training schedule for checkpoint/output knobs (run_id, checkpoint_dir/checkpoint_path, infer_split, redimensionalize, ...).

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

- Introduced `infer.py`, a model-agnostic inference script that loads trained checkpoints, runs inference over specified dataset splits, and outputs predictions in the native `.pdmsh` format.
- Added `infer.yaml` configuration file to define inference parameters, including dataset selection, checkpoint locations, and output settings.
- Ensured compatibility with existing training configurations by sharing base settings, allowing for consistent precision and data loading across training and inference phases.
- Moved the import of `Float` from `jaxtyping` to improve code organization.
- Removed unnecessary blank lines to enhance readability.
…ynamics

- Updated the README to clarify that users can both train and run inference with models.
- Added detailed instructions for running inference on trained checkpoints, including command examples and output specifications.
- Introduced force and moment coefficient integration in the inference configuration, enabling users to obtain physical coefficients from surface cases.
- Improved the `infer.yaml` configuration to support force coefficient calculations and added relevant parameters for reference areas and moment centers.
- Refactored the `datasets.py` and `infer.py` scripts to streamline data loading and inference processes, ensuring consistency across training and inference phases.
@peterdsharpe peterdsharpe requested a review from coreyjadams as a code owner June 8, 2026 16:12
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an inference script (src/infer.py + conf/infer.yaml) and integrated aerodynamic force/moment coefficient support (src/forces.py) to the Unified External Aero Recipe, and refactors the previous train.py monolith to share dataloader assembly, device movement, autocast, and JSONL logging with the new companion script.

  • forces.py: new module that integrates per-cell Cp/Cf surface fields into CD/CL/CS/CMR/CMP/CMY via Mesh.integrate; physics derivation and sign conventions are well-documented and backed by closed-body analytical tests.
  • infer.py: model/dataset-agnostic inference loop (reuses build_dataloaders, collate, MetricCalculator, and ForceContext); outputs one .pdmsh per sample with pred_<field> / true_<field>, metrics in training space, and optional force summaries.
  • Refactor (datasets.py, utils.py): build_dataloaders, recursive_to_device, get_autocast_context, make_jsonl_logger, and resolve_dict moved from train.py to shared modules; train.py delegates to these with no logic change.

Important Files Changed

Filename Overview
examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/forces.py New module implementing integrated aerodynamic force/moment coefficients (CD/CL/CS/CMR/CMP/CMY) via surface traction integration; physics looks correct, well-documented with unit tests, minor missing guard for U_inf key.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/infer.py New inference companion to train.py; reuses build_dataloaders, collate, MetricCalculator, and ForceContext cleanly; distributed all-reduce for metrics is correct but the force accumulator all-reduce has a latent tensor-size-mismatch risk in edge cases.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/datasets.py build_dataloaders and related helpers moved from train.py to datasets.py to enable shared use by infer.py; logic is identical to the original, refactor looks clean.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/train.py Significantly simplified by delegating dataloader construction, recursive device movement, autocast, and JSONL logging to shared utilities; no logic changes, only removals of code now in utils.py/datasets.py.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/utils.py Added shared utilities extracted from train.py: get_autocast_context, recursive_to_device, make_jsonl_logger, resolve_dict, and the Precision type alias; implementations are identical to the originals.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/tests/test_forces.py Good coverage: closed-surface pressure identity, uniform shear drag, reference-area scaling, orthonormal frame, degenerate-up handling, field identification, ForceContext config variants, and accumulator means/MAE.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/tests/test_infer.py Tests cover field-type resolution, redimensionalization, _to_pointwise, sample-id derivation, checkpoint-path resolution, and the DomainMesh write/round-trip without needing a real model or dataset.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/conf/infer.yaml Well-documented YAML with sensible defaults; shares base.yaml, aliases infer_split to val_split cleanly, and documents the moment_center / CenterMesh caveat.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/metrics.py Added resolve_metrics helper shared by train.py and infer.py; return type annotation says list[MetricName] but OmegaConf.to_container returns Any, so invalid names aren't caught until MetricCalculator lookup.
examples/cfd/external_aerodynamics/unified_external_aero_recipe/tests/test_train_helpers.py Minor updates to reflect functions moved from train.py to datasets.py/utils.py; test logic unchanged.

Comments Outside Diff (2)

  1. examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/infer.py, line 936-940 (link)

    P2 Distributed all-reduce tensor size mismatch when force accumulator is empty on some ranks

    _allreduce_sums packs {key: running_sum} entries into a single tensor whose length equals len(totals) + 1. If force_ctx is active (the dataset YAML declares Cp/Cf fields) but a rank's sampler shard happens to contain only non-surface samples (e.g., the per-rank shape check interior.points.shape[0] != vehicle.n_cells returns None for every sample), that rank's force_acc.totals remains an empty dict (len == 0) while other ranks carry 18 keys. dist.all_reduce requires all participating tensors to have the same size, so this would produce a NCCL error or hang. A simple fix is to pre-populate force_acc.totals with zeros for all 18 keys at construction time so the size is always consistent.

  2. examples/cfd/external_aerodynamics/unified_external_aero_recipe/src/infer.py, line 900-912 (link)

    P2 Silent data loss in distributed inference — no runtime warning

    attach_and_save is guarded by if is_rank0, so in multi-rank execution (e.g., torchrun --nproc_per_node=N) only rank 0's sampler shard ever gets written to disk. The other N−1 shards are processed (metrics and force coefficients are all-reduced correctly) but their predictions are silently dropped. The module docstring mentions this limitation, but there is no runtime logger.warning to alert users who accidentally run with torchrun expecting a full prediction set.

Reviews (1): Last reviewed commit: "interrogate fixes" | Re-trigger Greptile

…dynamics

- Modified the file patterns in `drivaer_ml_surface.yaml`, `drivaer_ml_volume.yaml`, and `highlift_volume.yaml` to specify directory structures, ensuring correct data loading for the respective readers.
- Adjusted patterns to include specific prefixes for better organization and clarity in dataset management.
…stency

- Changed file patterns in `README.md`, `shift_suv_estate_surface.yaml`, and `shift_suv_fastback_surface.yaml` to include a `run_*` prefix, ensuring uniformity in data loading paths across the unified external aerodynamics recipe.
- Adjusted the pattern in `merge_global_data.py` to match the updated structure for better integration with the data pipeline.
@peterdsharpe

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@peterdsharpe

Copy link
Copy Markdown
Collaborator Author

/ok to test 0101ef0

Comment thread examples/cfd/external_aerodynamics/unified_external_aero_recipe/conf/infer.yaml Outdated
@peterdsharpe

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@peterdsharpe

Copy link
Copy Markdown
Collaborator Author

/ok to test b290b3a

@peterdsharpe

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants