A new X-MeshGraphNet example for reservoir simulation. #1186

tonishi-nv · 2025-10-26T22:40:18Z

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds a new reservoir simulation example to PhysicsNeMo using X-MeshGraphNet (XMGN), located in examples/reservoir_simulation/. The implementation provides a complete workflow for training graph neural networks on subsurface flow data from ECLIPSE/OPM/IX simulators. The example includes: (1) sim_utils/ package for reading binary simulation files and constructing graph representations from reservoir grids, (2) preprocessing pipeline that converts raw simulator outputs to partitioned PyTorch Geometric graphs with train/val/test splits, (3) training infrastructure with distributed support, MLflow tracking, and early stopping, and (4) autoregressive inference that predicts multiple timesteps and exports results in both HDF5 and GRDECL formats. The code integrates with PhysicsNeMo's distributed training patterns and checkpoint management while adding domain-specific features like transmissibility-based edge weights, well completion handling, and nonlinear scaling for physical variables. The architecture mirrors existing PhysicsNeMo examples (Hydra configuration, modular utilities) while addressing reservoir simulation requirements such as handling inactive cells, temporal sequences, and industry-standard file formats.

Important Files Changed

Filename	Score	Overview
`examples/reservoir_simulation/xmgn/src/train.py`	2/5	Adds complete training pipeline but has critical issues: MLflow run never closed, gradient accumulation logic inside partition loop instead of outside, and potential race conditions in distributed trash cleanup
`examples/reservoir_simulation/xmgn/requirements.txt`	2/5	Adds dependencies but contains version mismatch between torch (>=2.4.0) and PyG wheel instructions (torch-2.9.0), plus unused torchvision/torchaudio dependencies
`examples/reservoir_simulation/xmgn/src/data/__init__.py`	2/5	Creates data package API but uses absolute import `from sim_utils import ...` that will fail since sim_utils is a sibling directory, should use relative import `from ...sim_utils import ...`
`examples/reservoir_simulation/xmgn/src/data/dataloader.py`	3/5	Implements graph dataset and partitioning but missing null checks for edge_attr indexing and empty tensor validation before shape access could cause runtime crashes
`examples/reservoir_simulation/sim_utils/grid.py`	3/5	Adds Grid class for reservoir connectivity but remapping logic uses slow Python loops, creates invalid (0,0) edges in edge cases, and has debug prints that should be converted to logging
`examples/reservoir_simulation/xmgn/src/preprocessor.py`	3.5/5	Implements METIS-based graph partitioning with fallback, but swallows exceptions during partition organization and shadows loop variable `i` in nested loops
`examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py`	3/5	Provides MLflow logging utilities but contains bare except clause catching system exits, unsafe attribute access leaving dataset_name unbound, and uses deprecated datetime.utcnow()
`examples/reservoir_simulation/sim_utils/ecl_reader.py`	4/5	Adds ECLIPSE binary file parser with extensive functionality but loads entire files into memory and uses bare except clauses that could mask critical errors
`examples/reservoir_simulation/sim_utils/well.py`	4/5	Introduces well/completion data structures with minor documentation inconsistencies (missing `stat` parameter, references nonexistent `set_out_flow_rate` method)
`examples/reservoir_simulation/xmgn/src/data/graph_builder.py`	4/5	Comprehensive graph builder converting simulator outputs to PyTorch Geometric format with extensive validation, minor unused imports and potentially strict 20% failure threshold
`examples/reservoir_simulation/xmgn/src/inference.py`	4/5	Autoregressive inference script with checkpoint loading and dual output formats (HDF5/GRDECL), contains bare except clause catching all exceptions including keyboard interrupts
`examples/reservoir_simulation/xmgn/src/utils/path_utils.py`	4/5	Path management utilities with inconsistent validation - checks runspec.job_name but not dataset.sim_dir before access which could raise AttributeError
`examples/reservoir_simulation/xmgn/conf/config.yaml`	4/5	Main Hydra configuration for XMGN with comprehensive settings, minor trailing whitespace issue and should validate array length constraints at runtime
`examples/reservoir_simulation/xmgn/conf/config_norne.yaml`	4/5	Norne dataset-specific configuration with relative paths assuming specific directory structure and skip_graphs flag requiring pre-existing files
`examples/reservoir_simulation/xmgn/README.md`	4/5	Documents complete workflow from data prep to inference with minor config path discrepancy (conf/config.yaml vs config/config.yaml) and incomplete Option 2 section
`examples/reservoir_simulation/sim_utils/__init__.py`	4/5	Creates sim_utils package API with clean exports, assumes existence of sibling modules (ecl_reader, grid, well)
`examples/reservoir_simulation/sim_utils/README.md`	5/5	Well-structured documentation for sim_utils package with practical code examples and clear usage patterns
`CHANGELOG.md`	5/5	Adds single entry documenting new reservoir simulation example in correct format and location
`examples/reservoir_simulation/xmgn/.gitignore`	5/5	Standard MLflow mlruns/ exclusion, correct and necessary
`examples/reservoir_simulation/xmgn/src/__init__.py`	5/5	Empty package marker with proper license header, standard Python packaging
`examples/reservoir_simulation/xmgn/src/utils/__init__.py`	5/5	Empty utils package initialization with license header and empty `__all__` list

Confidence score: 2/5

This PR requires careful review due to multiple critical issues in core training and data loading logic that could cause runtime failures or incorrect behavior
Score reflects critical bugs in train.py gradient accumulation, import path errors in data/init.py, version mismatches in requirements.txt, missing null checks in dataloader, and bare except clauses that could mask errors; additionally concerns about memory usage patterns (loading entire files), potential race conditions in distributed cleanup, and deprecated API usage
Pay close attention to examples/reservoir_simulation/xmgn/src/train.py (gradient accumulation logic, MLflow lifecycle), examples/reservoir_simulation/xmgn/src/data/__init__.py (import paths), examples/reservoir_simulation/xmgn/requirements.txt (version alignment), examples/reservoir_simulation/xmgn/src/data/dataloader.py (null safety), and examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py (exception handling)

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant EclReader
    participant Grid
    participant GraphBuilder
    participant Trainer
    participant Model
    participant Inference

    User->>Preprocessor: python src/preprocessor.py
    Preprocessor->>Preprocessor: validate_config()
    Preprocessor->>Preprocessor: check_existing_data()
    
    Preprocessor->>GraphBuilder: ReservoirGraphBuilder(cfg)
    GraphBuilder->>EclReader: find simulation files (.DATA/.AFI)
    
    loop For each simulation case
        GraphBuilder->>EclReader: read_init(static properties)
        GraphBuilder->>EclReader: read_egrid(grid geometry)
        GraphBuilder->>Grid: Grid(init_data, egrid_data)
        Grid->>Grid: compute cell coordinates
        Grid->>Grid: compute connections & transmissibilities
        GraphBuilder->>EclReader: read_restart(dynamic properties)
        GraphBuilder->>GraphBuilder: build_graph_from_simulation_data()
        GraphBuilder->>GraphBuilder: apply_nonlinear_scaling()
        GraphBuilder->>Preprocessor: return graph
    end
    
    Preprocessor->>Preprocessor: save_graphs()
    Preprocessor->>Preprocessor: create_partitions_from_graphs()
    Preprocessor->>Preprocessor: split_samples_by_case()
    Preprocessor->>Preprocessor: organize_partitions_by_split()
    Preprocessor->>Preprocessor: compute_global_statistics()
    Preprocessor->>Preprocessor: save_dataset_metadata()
    
    User->>Trainer: torchrun src/train.py
    Trainer->>Trainer: InitializeLoggers(cfg)
    Trainer->>Trainer: initialize_dataloaders()
    Trainer->>Trainer: initialize_model(MeshGraphNet)
    Trainer->>Trainer: initialize_optimizer(AdamW)
    
    loop For each epoch
        loop For each batch
            Trainer->>Model: forward(x, edge_attr, graph)
            Model->>Trainer: predictions
            Trainer->>Trainer: compute_weighted_loss()
            Trainer->>Trainer: backward()
            Trainer->>Trainer: optimizer.step()
        end
        
        alt Validation epoch
            Trainer->>Trainer: validate_epoch()
            Trainer->>Trainer: denormalize_predictions()
            Trainer->>Trainer: compute_per_variable_losses()
            Trainer->>Trainer: check_early_stopping()
            Trainer->>Trainer: save_checkpoint()
        end
    end
    
    User->>Inference: python src/inference.py
    Inference->>Inference: load best checkpoint
    Inference->>Inference: load test dataset
    
    loop For each test case
        loop Autoregressive timesteps
            Inference->>Model: forward(x, edge_attr, graph)
            Model->>Inference: predictions
            Inference->>Inference: update_node_features_with_predictions()
            Inference->>Inference: denormalize_predictions()
        end
    end
    
    Inference->>Inference: save_case_results_hdf5()
    Inference->>Inference: run_post()
    Inference->>Inference: generate GRDECL files
    Inference->>User: inference results & visualizations

_{21 files reviewed, 39 comments}

_{Edit Code Review Agent Settings | Greptile}

examples/reservoir_simulation/sim_utils/well.py

examples/reservoir_simulation/xmgn/src/data/__init__.py

examples/reservoir_simulation/xmgn/src/utils.py

examples/reservoir_simulation/xmgn/src/train.py

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The current diff contains documentation-only fixes to the well.py file within the reservoir simulation utilities. Three docstring errors were corrected: (1) removed a reference to a non-existent id attribute in the Completion class, (2) fixed a method name typo from set_out_flow_rate to the correct set_flow_rate, and (3) added the missing stat parameter to the Well.__init__ docstring. These changes align the API documentation with the actual implementation and address the three syntax issues flagged in the previous review (lines 33-42, 106-110).

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/sim_utils/well.py	5/5	Fixed three docstring inconsistencies: removed non-existent `id` reference, corrected method name typo, and added missing `stat` parameter documentation

Confidence score: 5/5

This PR is safe to merge with minimal risk
Score reflects documentation-only changes with no functional code modifications—these fixes correct previously flagged syntax issues in docstrings without touching any logic or introducing new code paths
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest update addresses two documentation and defensive programming issues identified in previous reviews. The first change corrects a path inconsistency in the README where the configuration file path was referenced as both config/config.yaml and conf/config.yaml in different locations—the fix standardizes all references to conf/config.yaml to match the actual project directory structure. The second change adds validation for the cfg.dataset.sim_dir configuration attribute in path_utils.py, mirroring the existing pattern for cfg.runspec.job_name validation. This prevents potential AttributeErrors when the configuration is incomplete, ensuring the utility function fails fast with a clear error message rather than propagating an obscure exception during path construction.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/README.md	5/5	Documentation fix correcting configuration file path from `config/config.yaml` to `conf/config.yaml` for internal consistency
examples/reservoir_simulation/xmgn/src/utils/path_utils.py	5/5	Added defensive validation check for `cfg.dataset.sim_dir` before access, preventing AttributeError on incomplete configuration

Confidence score: 5/5

This PR is safe to merge with minimal risk—both changes are minor defensive improvements with no functional side effects
Score reflects that these are documentation and validation fixes addressing specific review feedback, with no new logic, API changes, or complex interactions that could introduce bugs
No files require special attention—both changes are straightforward corrections following established patterns in the codebase

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review. The current change is a trivial formatting fix in examples/reservoir_simulation/xmgn/conf/config.yaml, removing trailing whitespace after the silu activation function specification on line 88. This addresses a linting violation previously flagged and ensures compliance with the repository's pre-commit hooks (which include ruff format and markdownlint). The change has zero functional impact—it simply removes invisible whitespace characters that violated the project's formatting standards.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/conf/config.yaml	5/5	Removed trailing whitespace after 'silu' on line 88 (pure formatting fix)

Confidence score: 5/5

This PR is safe to merge with minimal risk—it contains only a trivial whitespace removal
Score reflects the non-functional nature of the change: removing invisible trailing spaces cannot introduce bugs or break existing functionality
No files require special attention; this is a single-line formatting correction with no code logic changes

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest change updates the requirements.txt file to adjust PyTorch and PyTorch Geometric (PyG) extension versions and versioning strategy.

What changed: The torch version constraint was modified from an implicit dependency on torch 2.9.0+cu128 to an explicit minimum version torch>=2.4.0. The PyG extension URL comment was updated from torch 2.9.0+cu128 to torch 2.4.0+cu121 wheels, and minimum version constraints (>=) were added to torch-scatter, torch-sparse, torch-cluster, and torch-spline-conv packages. This change aligns the CUDA toolkit version from CUDA 12.8 to CUDA 12.1.

Why this matters: The developer is standardizing on torch 2.4.0+cu121 as the tested baseline, likely because torch 2.9.0 is not yet widely available or because PyG wheel compatibility is more reliable for 2.4.0. The change from exact pinning to minimum version constraints (>=) allows patch and minor updates while ensuring a tested baseline. However, this introduces a version coordination risk between the torch requirement and the PyG wheel URL.

How it fits: This requirements file defines the environment for the new X-MeshGraphNet reservoir simulation example (examples/reservoir_simulation/xmgn/). The dependency versions must align with the PhysicsNeMo core library's torch requirements (defined in the root pyproject.toml) and with available pre-built PyG wheels to ensure smooth installation.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/requirements.txt	3/5	Updated torch from 2.9.0+cu128 to >=2.4.0, changed PyG wheel URL to torch-2.4.0+cu121, and added minimum version constraints to PyG extensions

Confidence score: 3/5

This change reduces installation brittleness by moving from torch2.9.0 (potentially unavailable) to the more stable 2.4.0+cu121 baseline, but introduces version coordination risks between the torch constraint and PyG wheel URL.
Score reflects a critical version mismatch issue: the torch requirement torch>=2.4.0 (line 5) permits any version ≥2.4.0, but the PyG install URL comment (line 17) points to wheels built specifically for torch 2.4.0+cu121. If users install torch 2.5+ or 2.9+, pip will fail to find matching PyG wheels and attempt error-prone source builds. Additionally, the >= constraints on PyG extensions lack upper bounds, risking future incompatibilities.
Pay close attention to examples/reservoir_simulation/xmgn/requirements.txt lines 5 and 17-21: Consider pinning torch to ==2.4.0 or explicitly documenting that users must install torch 2.4.0 before running the PyG install command. The version mismatch between the flexible torch constraint and the fixed PyG wheel URL will cause installation failures in many environments.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. Two files were modified: a minor exception handling improvement and a dependency version update. In examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py, a bare except: clause was changed to except Exception: on line 80, following Python best practices to avoid catching system-level signals like KeyboardInterrupt. This utility flattens Hydra configurations for MLflow logging in the X-MeshGraphNet reservoir simulation example. In examples/reservoir_simulation/xmgn/requirements.txt, minimum version constraints were added to four PyG extension packages (torch-scatter>=2.1.2, torch-sparse>=0.6.18, torch-cluster>=1.6.3, torch-spline-conv>=1.2.2) that previously had no version pins. These packages depend on PyG 2.6.0 and torch 2.4.0+cu121, and the version constraints improve reproducibility by preventing installation of older incompatible versions. Both changes are small incremental improvements to code quality and dependency management.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py	5/5	Changed bare except to `except Exception:` on line 80 to allow system signals to propagate
examples/reservoir_simulation/xmgn/requirements.txt	4/5	Added minimum version constraints to four PyG extension packages (torch-scatter, torch-sparse, torch-cluster, torch-spline-conv)

Confidence score: 4/5

This PR contains low-risk quality improvements with one potential installation issue that requires attention
Score reflects that the exception handling fix is safe and correct, but the requirements.txt change creates a potential version mismatch between the hardcoded PyG wheel URL (torch-2.4.0+cu121.html) and the newly specified minimum versions which may not exist at that URL, potentially causing pip to fall back to source builds
Pay close attention to examples/reservoir_simulation/xmgn/requirements.txt - verify that the minimum versions specified (torch-scatter>=2.1.2, torch-sparse>=0.6.18, torch-cluster>=1.6.3, torch-spline-conv>=1.2.2) are actually available at the PyG wheel repository URL mentioned in the installation comment, or update the comment to reflect the correct installation procedure

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent update improves error handling in mlflow_utils.py by addressing variable scoping issues and deprecation warnings that were identified in previous reviews.

The changes pre-initialize job_name and dataset_name with fallback values before the try block, preventing UnboundLocalError exceptions when the exception handler needs to access these variables. Additionally, nested attribute access patterns are replaced with safer getattr() calls that provide defaults when config sections are missing. The deprecated datetime.utcnow() call is updated to datetime.now(timezone.utc) for Python 3.12+ compatibility. These defensive programming patterns align with PhysicsNeMo's approach of graceful degradation—the MLflow logging should capture as much metadata as possible without failing the entire training run if configuration fields are malformed or absent.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py	5/5	Pre-initializes fallback variables and replaces direct attribute access with safer `getattr` patterns to prevent `UnboundLocalError` and `AttributeError` exceptions; updates deprecated datetime API

Confidence score: 5/5

This PR is safe to merge with minimal risk
All previous critical issues in this file have been addressed: variable scoping prevents UnboundLocalError, defensive attribute access prevents AttributeError, and the deprecated datetime API is updated for forward compatibility
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. Two minor fixes were applied based on previous code review feedback: one addresses a Python 3.12 deprecation in MLflow timestamp logging, and the other replaces a bare except: clause with a more specific exception handler in the inference script.

Changes:

examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py: Replaced deprecated datetime.utcnow() with datetime.now(timezone.utc) for timezone-aware timestamp generation in MLflow logging
examples/reservoir_simulation/xmgn/src/inference.py: Changed bare except: on line 145 to except Exception: to prevent catching system-level exceptions like KeyboardInterrupt and SystemExit

Both changes follow Python best practices and align with patterns established elsewhere in the physicsnemo codebase. The MLflow fix future-proofs the code against Python 3.12+ where utcnow() will be removed, while the inference fix ensures proper exception handling without blocking graceful program termination.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py	5/5	Replaced deprecated `datetime.utcnow()` with timezone-aware `datetime.now(timezone.utc)` for MLflow timestamp logging
examples/reservoir_simulation/xmgn/src/inference.py	5/5	Fixed bare `except:` clause to `except Exception:` to allow system interrupts to propagate correctly

Confidence score: 5/5

This PR is safe to merge with minimal risk
Score reflects simple, well-understood deprecation and best-practice fixes with no functional impact or complex logic changes
No files require special attention

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent commit removes two unused return values from helper methods in examples/reservoir_simulation/xmgn/src/inference.py: the target_vars variable that was being discarded by all callers, and a redundant case_name extraction that was immediately re-computed at the call site. This is a straightforward code cleanup that simplifies tuple unpacking without changing behavior.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/src/inference.py	5/5	Removed unused return values (`target_vars` and `case_name`) from two private helper methods to eliminate dead code

Confidence score: 5/5

This PR is safe to merge with minimal risk
The changes are pure refactoring—removing return values that were systematically ignored by all callers (evidenced by _ unpacking at lines 271 and 791), with no logic changes or new functionality
No files require special attention; this is straightforward dead-code elimination

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

A single targeted fix in the ECLIPSE binary file reader replaces a bare except: clause with explicit exception handling. The change on line 584 of ecl_reader.py now catches only struct.error and IndexError - the two specific exceptions that can occur when unpacking binary record headers during UNRST file parsing. This follows up on a previous review comment about exception-handling hygiene.

The _read_unrst method reads ECLIPSE reservoir simulation restart files by parsing binary records. Each record header is validated using struct.unpack to read a 4-byte integer. The new exception specification ensures that if malformed data or buffer boundary issues occur, the code silently continues pattern matching (as intended), but critical exceptions like KeyboardInterrupt and SystemExit will now propagate correctly. This is a standard practice for production code that processes potentially corrupted binary data streams.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/sim_utils/ecl_reader.py	5/5	Replaced bare except with specific `struct.error` and `IndexError` handling in binary file parser

Confidence score: 5/5

This is a safe, localized fix that improves exception handling without changing functionality
The change is minimal (single line), well-scoped, and follows Python best practices for exception handling
No files require special attention - this is a straightforward code quality improvement

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest change applies explicit log-level guards to all logging calls in the _read_unrst method of examples/reservoir_simulation/sim_utils/ecl_reader.py. Each logging.info() and logging.debug() statement is now wrapped in an if logging.getEffectiveLevel() <= logging.INFO/DEBUG: check to prevent string formatting overhead when the logger is set to WARNING or ERROR levels. This addresses a performance concern identified in earlier reviews for a data-loading hotpath that processes thousands of timesteps in ECLIPSE reservoir simulation restart files. While Python's logging module performs internal level checks, the explicit guards prevent f-string evaluation before the check, which can be meaningful when processing large datasets in production environments where verbose logging is disabled.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/sim_utils/ecl_reader.py	4/5	Added explicit log-level checks around every logging statement in `_read_unrst` to avoid string formatting overhead when verbose logging is disabled

Confidence score: 4/5

This PR is safe to merge with minimal risk; the changes are purely performance optimizations that do not alter functionality
Score reflects that while the pattern introduces code duplication and reduces readability, it does not introduce bugs or breaking changes; the optimization is valid though the benefit may be marginal since Python's logging module already performs level checks internally
Pay close attention to examples/reservoir_simulation/sim_utils/ecl_reader.py to verify the readability trade-off is acceptable for your team's coding standards, as the repeated guards make the method significantly more verbose

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

examples/reservoir_simulation/sim_utils/ecl_reader.py

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest change is a minor documentation refinement to the README file for the X-MeshGraphNet reservoir simulation example. The update removes the word "wells" from the visualization section caption (since the images don't actually show wells) and makes the permeability label more specific by adding "(PERMX)" to clarify that the visualization shows permeability in the X-direction. This matches petroleum engineering conventions where permeability is directional (PERMX, PERMY, PERMZ) and improves consistency between the caption, the image filename, and the table cell description. The change is purely cosmetic and does not affect any code, configurations, or functionality.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/README.md	5/5	Documentation clarification removing misleading "wells" reference and adding "(PERMX)" to permeability label for precision

Confidence score: 5/5

This PR is safe to merge with minimal risk; it contains only documentation improvements with no functional changes
Score reflects the purely cosmetic nature of the changes—only caption text was modified to improve accuracy and consistency
No files require special attention; this is a straightforward documentation correction

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The current update addresses previously identified issues with minimal code changes across four files. The copyright year is updated to 2025 in three files (__init__.py, grid.py, train.py). The sim_utils package receives proper initialization with explicit imports, an __all__ declaration, version definition, and comprehensive documentation. A critical MLflow resource leak is fixed by removing redundant mlflow.start_run() call that conflicted with PhysicsNeMo's initialize_mlflow(). The fix_layernorm_compatibility() function is refactored to replace a lambda with a named function for better debuggability. These changes improve code maintainability and resolve integration issues with PhysicsNeMo's logging framework, while the bulk of the X-MeshGraphNet reservoir simulation example code remains unchanged from the previous review.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/sim_utils/init.py	5/5	Formalized package structure with explicit imports, `__all__` declaration, version definition, and comprehensive documentation; copyright updated to 2025
examples/reservoir_simulation/sim_utils/grid.py	5/5	Copyright year updated from2024 to 2025; no functional changes
examples/reservoir_simulation/xmgn/src/utils.py	3/5	Refactored lambda to named function `_register_load_state_dict_pre_hook` for improved debugging; potential method signature issue remains
examples/reservoir_simulation/xmgn/src/train.py	5/5	Fixed MLflow resource leak by removing redundant `mlflow.start_run()` call; copyright updated to2025

Confidence score: 4/5

This PR is mostly safe to merge but requires careful review of one file due to a potential runtime issue
Score reflects that three files are straightforward improvements (copyright updates, package initialization, resource leak fix) but utils.py still has a potentially problematic method signature in _register_load_state_dict_pre_hook that may cause runtime errors when loading checkpoints. The function takes (self, hook) parameters but PyTorch's register_load_state_dict_pre_hook typically expects different arguments. If checkpoint loading is not tested in this PR's scope, this could surface as a runtime error later.
Pay close attention to examples/reservoir_simulation/xmgn/src/utils.py line 38-44. Test checkpoint loading functionality to verify the compatibility shim works correctly, or confirm this is intentionally a no-op stub. The other three files are safe to merge without additional scrutiny.

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. This update is a documentation-only change that improves the README file for the X-MeshGraphNet reservoir simulation example. The changes add a comprehensive "Overview" section that provides domain context (explaining reservoir simulation, subsurface flow, and why X-MGN is suitable for this application), enhance the setup and configuration instructions with clearer guidance on config file usage across different stages (preprocessing, training, inference), and standardize placeholder naming conventions from <your_experiment_name> to <your-experiment-name> throughout command examples. These improvements help users understand the example's purpose and reduce confusion during setup, particularly around which configuration file to use at each stage of the workflow. The changes align with the repository's documentation standards and improve accessibility for users unfamiliar with reservoir simulation domains.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/README.md	5/5	Added overview section explaining reservoir simulation context, clarified config file usage across workflow stages, and standardized placeholder syntax from underscores to hyphens

Confidence score: 5/5

This PR is safe to merge with minimal risk as it contains only documentation improvements with no code or configuration changes
Score reflects that all changes are purely informational and cannot introduce functional bugs, runtime errors, or behavioral changes to the codebase
No files require special attention; the README improvements enhance clarity without any technical concerns

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. This update modifies the requirements.txt for the X-MeshGraphNet reservoir simulation example by consolidating PyG extension installation and pinning core dependencies to exact versions. The --find-links directive is now placed at the top of the file, streamlining the installation process by eliminating the need for a separate two-step pip install command with the -f flag. Core dependencies (torch, torchaudio, torchvision) are now pinned to exact versions (2.4.0/0.19.0) instead of using minimum version constraints (>=), and nvidia-physicsnemo is pinned to version 1.2.0. Installation instruction comments have been removed, and a note about scipy being auto-installed has been added. This change fits within the broader PhysicsNeMo examples structure (see /examples directory documentation), which provides domain-specific demonstrations with Hydra configuration and turnkey setup processes, aiming to reduce friction for users setting up the reservoir simulation example.

Important Files Changed

Filename	Score	Overview
examples/reservoir_simulation/xmgn/requirements.txt	3/5	Consolidated PyG installation with `--find-links` directive and pinned all core dependencies to exact versions, removing flexible version constraints

Confidence score: 3/5

This PR is relatively safe to merge but has potential issues that should be addressed to improve flexibility and avoid future compatibility problems.
Score reflects concerns about strict version pinning preventing security patches and potential version mismatch between the wheel repository URL (torch-2.4.0+cu121) and the declared torch version (==2.4.0), which could cause installation failures if the wheel repository expects exactly torch 2.4.0+cu121 but the requirements install torch 2.4.0 without the CUDA suffix.
Pay close attention to examples/reservoir_simulation/xmgn/requirements.txt - verify that the --find-links URL (https://data.pyg.org/whl/torch-2.4.0+cu121.html) is compatible with the exact torch version (==2.4.0) specified, and consider whether strict pinning of all dependencies will cause issues with security patches and minor version updates.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

examples/reservoir_simulation/xmgn/.gitignore

examples/reservoir_simulation/sim_utils/README.md

examples/reservoir_simulation/xmgn/README.md

examples/reservoir_simulation/xmgn/conf/config.yaml

mnabian · 2025-10-30T00:58:31Z

docs/img/reservoir_simulation/xmgn/Norne/static/PORO.png

Is this used in README?

docs/img/reservoir_simulation/xmgn/2D_Q5SP/sample_103_timestep_003_PRESSURE.png

docs/img/reservoir_simulation/xmgn/2D_Q5SP/sample_103_timestep_003_SWAT.png

examples/reservoir_simulation/xmgn/requirements.txt

mnabian · 2025-10-30T01:01:26Z

examples/reservoir_simulation/sim_utils/__init__.py

+from .grid import Grid
+from .well import Well, Completion
+
+__all__ = ["EclReader", "Grid", "Well", "Completion"]


Please verify the code for all these utilities is written by NVIDIA. If the code has been taken from somewhere else, even if partially, we need to modify the license header.

I reused part of an open source code (my repository with Apache license: https://github.com/GEG-ETHZ/pyflowdiagnostics/tree/v0.1.0) and modified a bit. Do we need to add something to the license header?

Yes, in that case, please include the original license header first, followed by the NVIDIA license header. Add #ignore_header_test at the beginning of the file to skip the ci header test.

docs/img/reservoir_simulation/xmgn/Norne/static/PERMX.png

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds a comprehensive X-MeshGraphNet example for reservoir simulation surrogate modeling. The implementation includes Eclipse/IX binary file readers (sim_utils), parallel graph construction from simulation data, METIS-based graph partitioning for distributed training, and complete training/inference pipelines with MLflow tracking.

Key additions:

sim_utils package for reading Eclipse binary formats (INIT, EGRID, UNRST) with proper endianness handling
Parallel graph builder processing multiple simulation cases using multiprocessing with progress tracking
Preprocessor orchestrating graph generation, partitioning, data splits, and global statistics
Training pipeline with distributed support, gradient checkpointing, and early stopping
Inference with autoregressive rollout and GRDECL export for visualization
Comprehensive documentation with examples for 2D waterflood and Norne field datasets

The code demonstrates good software engineering practices with proper error handling, configuration validation, and memory-efficient processing. Previous review comments have identified several style and documentation issues that should be addressed, but no critical blocking issues were found in this review round.

Confidence Score: 4/5

This PR is safe to merge with minor style improvements recommended
The implementation is well-structured with comprehensive error handling, proper validation, and good documentation. Previous review rounds identified numerous style issues (bare excepts, logging inconsistencies, unused imports) and some logic concerns (import paths, potential race conditions, monkey-patching) that should be addressed but are not critical blockers. The core functionality appears sound with appropriate use of multiprocessing, memory management, and distributed training patterns. No new critical issues were identified in this review.
Previous comments highlight style improvements needed in train.py, inference.py, utils.py, grid.py, and well.py - primarily bare except clauses, logging consistency, and documentation fixes

Important Files Changed

File Analysis

Filename	Score	Overview
examples/reservoir_simulation/xmgn/src/data/graph_builder.py	4/5	New file implementing parallel graph construction from reservoir simulation data with multiprocessing, proper error handling, and efficient memory usage through immediate file saves
examples/reservoir_simulation/xmgn/src/preprocessor.py	4/5	New preprocessor orchestrating graph generation, partitioning, train/val/test splits, and global statistics computation with proper validation and user interaction
examples/reservoir_simulation/sim_utils/ecl_reader.py	4/5	New Eclipse binary file reader supporting INIT, EGRID, UNRST formats with endianness detection and comprehensive error handling

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant GraphBuilder
    participant EclReader
    participant Train
    participant Inference
    
    User->>Preprocessor: Run preprocessor.py
    Preprocessor->>Preprocessor: Validate config
    Preprocessor->>GraphBuilder: Create raw graphs
    
    loop For each simulation case
        GraphBuilder->>EclReader: Read INIT/EGRID/UNRST files
        EclReader-->>GraphBuilder: Grid data + dynamic vars
        GraphBuilder->>GraphBuilder: Build PyG graph per timestep
        GraphBuilder->>GraphBuilder: Save graph to disk
    end
    
    GraphBuilder-->>Preprocessor: List of graph files
    Preprocessor->>Preprocessor: Partition graphs (METIS)
    Preprocessor->>Preprocessor: Split train/val/test
    Preprocessor->>Preprocessor: Compute global statistics
    Preprocessor-->>User: Preprocessing complete
    
    User->>Train: Run train.py (distributed)
    Train->>Train: Load partitioned graphs
    Train->>Train: Train X-MGN model
    Train->>Train: Validate & checkpoint
    Train-->>User: Best model saved
    
    User->>Inference: Run inference.py
    Inference->>Inference: Load best checkpoint
    
    loop Autoregressive rollout
        Inference->>Inference: Predict next timestep
        Inference->>Inference: Use prediction as input
    end
    
    Inference->>Inference: Export HDF5 + GRDECL
    Inference-->>User: Results for visualization

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tsubasa Onishi <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds a comprehensive new example implementing X-MeshGraphNet (X-MGN) for reservoir simulation surrogate modeling, targeting the petroleum engineering domain. The implementation includes ~5000 lines of new code across data preprocessing, training, inference, and visualization components.

Key Components Added:

sim_utils module: Binary ECLIPSE/IX file readers (ecl_reader.py, grid.py) for industry-standard reservoir simulation data formats
Data pipeline: Graph construction from reservoir grids (graph_builder.py), partitioning for distributed training, and PyTorch Geometric dataloaders
Training infrastructure: Multi-GPU training with MLflow tracking, gradient scaling, and early stopping
Inference pipeline: Autoregressive rollout with HDF5 and GRDECL output formats for visualization in industry tools (Petrel, ResInsight)
Documentation: Comprehensive README with two example datasets (2D five-spot, Norne Field)

Implementation Quality:
The code demonstrates solid engineering practices with proper licensing headers, extensive documentation, and integration with PhysicsNeMo's distributed training infrastructure. The autoregressive inference implementation correctly handles temporal dependencies and partition boundaries (halo nodes).

Previous Review Coverage:
This PR has already undergone extensive review with 70+ comments addressing style issues (unused imports, logging inconsistencies), potential logical issues (gradient accumulation timing, edge index remapping), and security concerns (weights_only=False in torch.load, bare except clauses). Most issues are non-critical style/best-practice suggestions.

Critical Issues Previously Identified:

Grid edge remapping logic complexity in sim_utils/grid.py (lines 239-247)
Gradient scaler placement inside partition loop in train.py (line 674)
Import path inconsistency (sim_utils absolute vs relative imports)
torch.load with weights_only=False poses security risk if loading untrusted partition files (inference.py:713)

Confidence Score: 3/5

This PR introduces substantial new functionality with some logical concerns that should be addressed, but no critical blocking issues for an example/research codebase
Score of 3 reflects a balance between the PR's strengths and concerns. Positive factors: comprehensive documentation, proper integration with PhysicsNeMo infrastructure, CHANGELOG updated, and thorough examples. Concerns that lower confidence: (1) complex grid remapping logic with potential indexing bugs that could cause incorrect results, (2) gradient accumulation timing issue that may affect training convergence, (3) security risk from weights_only=False in torch.load, (4) numerous style issues suggesting the code hasn't been fully production-hardened. Since this is an example/research code rather than core library functionality, these issues are manageable but should be addressed before wider adoption
examples/reservoir_simulation/sim_utils/grid.py (edge remapping logic), examples/reservoir_simulation/xmgn/src/train.py (gradient scaler placement), and examples/reservoir_simulation/xmgn/src/inference.py (torch.load security)

Important Files Changed

File Analysis

Filename	Score	Overview
examples/reservoir_simulation/xmgn/src/inference.py	4/5	Autoregressive inference implementation with checkpoint loading, HDF5 output, and GRDECL generation - uses `weights_only=False` in torch.load on line 713
examples/reservoir_simulation/sim_utils/ecl_reader.py	4/5	ECLIPSE binary file reader supporting INIT, EGRID, UNRST formats - extensive functionality with some logging inconsistencies
examples/reservoir_simulation/sim_utils/grid.py	3/5	Grid processing with corner-point geometry and edge mapping - contains complex remapping logic with potential indexing issues
examples/reservoir_simulation/xmgn/src/train.py	3/5	Training script with distributed support, MLflow tracking, and early stopping - has gradient accumulation and scaler timing issues
examples/reservoir_simulation/xmgn/src/preprocessor.py	4/5	Data preprocessing pipeline for graph generation, partitioning, and train/val/test splitting
examples/reservoir_simulation/xmgn/conf/config_norne.yaml	4/5	Configuration for Norne Field dataset - contains relative path that may break with Hydra's chdir setting

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant EclReader
    participant Grid
    participant GraphBuilder
    participant Partitioner
    participant Trainer
    participant Model as MeshGraphNet
    participant Inference
    participant PostProcessor

    Note over User,PostProcessor: Phase 1: Data Preprocessing
    User->>Preprocessor: python preprocessor.py --config-name=config
    Preprocessor->>EclReader: read ECLIPSE binary files (.INIT, .EGRID, .UNRST)
    EclReader-->>Preprocessor: grid data, properties, timesteps
    Preprocessor->>Grid: create grid structure
    Grid-->>Preprocessor: corner-point geometry, connections, active cells
    Preprocessor->>GraphBuilder: build_graphs(cases, timesteps)
    loop For each case and timestep
        GraphBuilder->>GraphBuilder: create nodes (active cells) and edges (connections)
        GraphBuilder->>GraphBuilder: extract features (PERMX, PORO, PRESSURE, SWAT)
        GraphBuilder-->>Preprocessor: graph with node/edge features
    end
    Preprocessor->>Partitioner: partition graphs using METIS
    Partitioner-->>Preprocessor: partitioned graphs with halo nodes
    Preprocessor->>Preprocessor: split into train/val/test sets
    Preprocessor-->>User: processed graphs saved to dataset_dir

    Note over User,PostProcessor: Phase 2: Training
    User->>Trainer: torchrun train.py --config-name=config
    Trainer->>Trainer: initialize distributed training
    Trainer->>Trainer: load partitioned graphs
    Trainer->>Model: initialize MeshGraphNet
    loop For each epoch
        loop For each batch
            loop For each partition
                Trainer->>Model: forward(x, edge_attr, partition)
                Model-->>Trainer: predictions
                Trainer->>Trainer: calculate MSE loss on inner nodes
                Trainer->>Trainer: backward and accumulate gradients
            end
            Trainer->>Trainer: optimizer step
        end
        Trainer->>Trainer: validation on val set
        Trainer->>Trainer: check early stopping
    end
    Trainer-->>User: best checkpoint saved

    Note over User,PostProcessor: Phase 3: Inference
    User->>Inference: python inference.py --config-name=config
    Inference->>Inference: load best checkpoint
    Inference->>Inference: load test dataset
    loop For each case (autoregressive)
        loop For each timestep
            alt timestep < prev_timesteps + 1
                Inference->>Model: forward with true features
            else
                Inference->>Inference: update features with previous predictions
                Inference->>Model: forward with predicted features
            end
            Model-->>Inference: predictions
            Inference->>Inference: denormalize and store results
        end
    end
    Inference->>PostProcessor: save HDF5 files (case_name.hdf5)
    Inference->>PostProcessor: generate GRDECL files
    PostProcessor->>Grid: get coordinates and actnum
    PostProcessor->>PostProcessor: map active cells to full grid
    PostProcessor-->>User: HDF5 and GRDECL files for visualization

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mnabian · 2025-11-05T23:22:36Z

examples/reservoir_simulation/xmgn/src/data/graph_builder.py

+        Save wells_data (list of lists of dicts of Well objects) to a pickle file.
+        """
+        with open(self._output_path_well, "wb") as f:
+            pickle.dump(wells_data, f, protocol=pickle.HIGHEST_PROTOCOL)


Unfortunately, pickle has a security issue and it is recommended not to use it. Would it be possible to switch to json?

mnabian · 2025-11-05T23:35:30Z

examples/reservoir_simulation/xmgn/README.md

+- **HDF5 files**: Contain predictions and targets for each simulation case,
+  organized by timestep and variable
+- **GRDECL files**: Eclipse-compatible ASCII format that can be imported into
+  popular software such as Petrel and [ResInsight](https://resinsight.org/)


So the only way to visualize the results is to use these softwares, right?

Not necessarily. I chose this approach because: 1) It aligns with standard practices in the oil and gas industry. I am sure reservoir engineers will like it. 2) These visualization tools offer interactive features and the ability to display wells, faults, etc. (ResInsight is open source, similar to ParaView).

I’m happy to implement an alternative approach if you think it would make more sense.

tonishi-nv added 2 commits October 26, 2025 22:14

X-MGN for reservoir simulation

cd58455

Signed-off-by: Tsubasa Onishi <[email protected]>

installation bug fix

784342c

Signed-off-by: Tsubasa Onishi <[email protected]>

tonishi-nv requested review from ktangsali and megnvidia as code owners October 26, 2025 22:40

tonishi-nv changed the title ~~Tonishi/xmgn ressim~~ A new X-MeshGraphNet example for reservoir simulation. Oct 26, 2025

greptile-apps bot reviewed Oct 26, 2025

View reviewed changes

tonishi-nv added 2 commits October 27, 2025 11:10

well object docstring fix

b6c172f

Signed-off-by: Tsubasa Onishi <[email protected]>

more well object docstring fix

388ce1c

Signed-off-by: Tsubasa Onishi <[email protected]>