Skip to content

Conversation

@laserkelvin
Copy link
Collaborator

PhysicsNeMo Pull Request

Description

This PR introduces the initial components that make up the active learning abstraction within physicsnemo:

  • Adds a new namespace, physicsnemo.active_learning, with multiple submodules and a README.md that briefly explains the module layout until proper documentation is added. As an overview:
    • protocols defines interfaces (abstract classes) for the various components used by an active learning workflow
    • driver defines a concrete orchestrator that "drives" the end-to-end active learning process in an automated fashion with the ability to restart from checkpoints
    • config defines dataclasses that controls the flow of active learning
  • Adds accompanying unit tests under test/active_learning
  • Adds a simple example case that demonstrates active learning within physicsnemo with a well-known "2 moons" dataset: users can parse this example to understand what components need to be defined, and how they are composed together for Driver to execute.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

No new dependencies were added.

@laserkelvin laserkelvin requested a review from ktangsali October 23, 2025 22:06
@laserkelvin laserkelvin added the enhancement New feature or request label Oct 23, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes add a logger.py module to the physicsnemo.active_learning namespace, introducing a JSONLFileLogger class for structured logging with checkpointing support. The logger integrates with the Driver orchestrator to track metrics, training losses, and active learning progress. A corresponding unit test file test/active_learning/test_logger.py validates logging functionality, metric recording, and JSON serialization. Additionally, the Driver class in driver.py was updated to instantiate and use the logger during execution phases, and __init__.py was modified to expose JSONLFileLogger in the public API.

Important Files Changed

Filename Score Overview
physicsnemo/active_learning/logger.py 4/5 Adds JSONLFileLogger class with JSON Lines serialization for metrics and checkpoint metadata; potential issue with mode="w" overwriting logs on restart
test/active_learning/test_logger.py 4/5 Comprehensive unit tests for JSONLFileLogger covering directory creation, logging, context manager behavior, and JSON serialization
physicsnemo/active_learning/driver.py 4/5 Integrates logger instantiation and calls to log_scalar/log_training_loss methods in training/metrology phases
physicsnemo/active_learning/init.py 5/5 Adds JSONLFileLogger to public API exports in __all__ list

Confidence score: 4/5

  • This PR introduces focused logging infrastructure with clear integration points in the active learning workflow.
  • Score reflects one unresolved design question: the logger uses mode="w" which will overwrite existing logs when restarting from checkpoints (previously flagged in line 280 of logger.py). This could cause data loss for long-running experiments that resume from checkpoints, though it may be intentional for fresh logging per run. The other changes are well-structured and follow established patterns.
  • Pay close attention to physicsnemo/active_learning/logger.py to verify the mode="w" behavior aligns with checkpoint/restart requirements.

20 files reviewed, 16 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent change converts the moon_example.py demonstration from a simple training workflow into a comprehensive example that showcases checkpoint loading and resumption capabilities. The modifications include: (1) adding a time import, (2) disabling progress bars during training via use_progress_bars=False, (3) removing the explicit model save step, and (4) introducing a checkpoint reload demonstration that loads from a hardcoded checkpoint at "step_42" and conditionally reruns the driver when RERUN=True. This change aligns with the PR's goal of demonstrating the active learning Driver's checkpoint/resume feature, which is critical for production workflows that need fault tolerance. The modification fits naturally into the existing examples/active_learning/moons/ structure and serves as a reference for users implementing resumable active learning workflows.

Important Files Changed

Filename Score Overview
examples/active_learning/moons/moon_example.py 2/5 Adds checkpoint reload demonstration with hardcoded path and boolean flag

Confidence score: 2/5

  • This PR change introduces hardcoded assumptions and execution flags that make the example fragile and potentially unsafe to merge without modifications.
  • Score reflects three critical issues: (1) hardcoded RERUN=True flag will automatically overwrite subsequent checkpoints every time the example runs, (2) hardcoded "step_42" checkpoint path is brittle and will fail if configuration changes, (3) time.sleep(5) appears to be leftover debugging code with no clear purpose.
  • Pay close attention to lines 185-201 where the checkpoint reload logic is implemented; the hardcoded assumptions should be replaced with conditional execution (e.g., command-line arguments or environment variables) to prevent unintended checkpoint overwrites in CI/automated testing environments.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@dallasfoster
Copy link
Collaborator

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only changes made since the last review, not the entire PR. The latest changes focus on improving checkpoint/restart functionality and metrology record persistence. Three files were modified: 1) driver.py adds detailed documentation to load_checkpoint() explaining parameters, caveats, and explicitly warns that strategy states are not yet restored (with a TODO comment marking this limitation); 2) protocols.py adds strategy_dir and checkpoint_dir properties to enable standardized filesystem organization for strategy persistence, enhances MetrologyStrategy with optional path parameters for serialize_records() and adds a new load_records() method; 3) moon_strategies.py refactors the example F1Metrology class to write outputs to step-specific subdirectories under strategy_dir rather than overwriting files in driver.log_dir. These changes support the broader active learning abstraction's goal of enabling long-running experiments that can be interrupted and resumed, though some implementation gaps remain.

PR Description Notes:

  • PR title uses lowercase "abstraction" which is inconsistent with typical GitHub PR title casing conventions

Important Files Changed

Filename Score Overview
physicsnemo/active_learning/driver.py 5/5 Documentation improvements to load_checkpoint method with important admonitions and TODO marker
physicsnemo/active_learning/protocols.py 2/5 Adds strategy persistence infrastructure but has runtime coupling issues and missing integration
examples/active_learning/moons/moon_strategies.py 1/5 Refactors output organization but references undefined strategy_dir attribute causing AttributeError

Confidence score: 2/5

  • This PR introduces critical runtime errors that will prevent the example from executing successfully
  • Score lowered due to: (1) undefined strategy_dir attribute in moon_strategies.py will cause AttributeError at runtime; (2) implicit runtime coupling to self.driver in protocols.py without formal type declarations creates fragile dependencies; (3) missing integration for the new load_records() method means checkpoint restoration remains incomplete; (4) pre-existing precision/recall calculation bugs in the example that compute metrics for inconsistent classes
  • Pay close attention to examples/active_learning/moons/moon_strategies.py which will fail immediately when serialize_records() is called, and physicsnemo/active_learning/protocols.py where the properties assume driver attachment without type-safe guarantees

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers changes made since the last review, not the entire PR. A single line change was made to a test file to correct the expected exception type. The test test_missing_registration() in test/active_learning/test_registry.py previously expected KeyError when calling registry.construct("missing_strategy"), but now expects NameError. This aligns the test with the actual implementation behavior: the construct() method internally delegates to get_class(), which raises NameError for missing classes rather than KeyError. This change maintains consistency with the registry's error handling semantics, where direct dictionary access via registry["missing_strategy"] raises KeyError (as tested on line 54), while construction via construct() raises NameError due to its internal delegation path.

Important Files Changed

Filename Score Overview
test/active_learning/test_registry.py 5/5 Changed expected exception from KeyError to NameError in test_missing_registration() to match actual construct() behavior

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • The change is a single-line test correction that aligns the test expectations with the actual implementation behavior, ensuring test accuracy without affecting production code
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dallasfoster
Copy link
Collaborator

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers the entire pull request from start to finish, including all changes made. This PR introduces a comprehensive active learning abstraction framework to PhysicsNeMo under the new physicsnemo.active_learning namespace. The implementation provides protocol-based interfaces for query, labeling, and metrology strategies, a Driver orchestrator that automates the end-to-end workflow with checkpoint/restart capabilities, configuration dataclasses for controlling the AL loop, and a default training loop with static capture optimization. The framework integrates with PhysicsNeMo's existing infrastructure (DistributedManager, DDP) and follows the repository's protocol-driven design philosophy. A pedagogical "2 moons" binary classification example demonstrates how users compose strategies, configure the driver, and execute active learning workflows. The changes include comprehensive unit tests covering registry, configuration, checkpointing, training loop, and driver orchestration.

Important Files Changed

Filename Score Overview
physicsnemo/active_learning/protocols.py 5/5 Defines protocol-based interfaces for all AL components (strategies, models, driver) with comprehensive documentation and mermaid diagrams
physicsnemo/active_learning/driver.py 3.5/5 Implements orchestrator that coordinates training/query/labeling/metrology phases with DDP support and checkpoint management; has optimizer state handling issues
physicsnemo/active_learning/loop.py 2/5 Provides training loop with static capture and checkpointing; critical issues with LR scheduler stepping per batch, validation caching, and device handling after optimizer init
physicsnemo/active_learning/config.py 4.5/5 Configuration dataclasses with validation and serialization; minor redundancy in device serialization and max_active_learning_steps validation
physicsnemo/active_learning/_registry.py 4.5/5 Registration system for AL components with decorator-based registration and dynamic class loading; logic issue with registry check timing
physicsnemo/active_learning/logger.py 4/5 Structured logging with context injection and dual formatters; file handler overwrites logs on restart and requires Python 3.10+
physicsnemo/active_learning/init.py 4/5 Module initialization exposing public API; missing protocols module from __all__ which users need for custom implementations
examples/active_learning/moons/moon_example.py 4/5 Demonstration script for 2-moons classification; model instance reuse during checkpoint loading may cause unexpected behavior
examples/active_learning/moons/moon_strategies.py 2/5 Example query/labeling/metrology strategies; critical precision/recall calculation errors with inconsistent docstrings
examples/active_learning/moons/moon_data.py 4/5 Two-moon dataset implementation with DataPool protocol; potential rounding bug in label assignment due to integer truncation
test/active_learning/test_driver.py 3.5/5 Driver orchestration tests; logical inconsistency in skip_training test and queue manipulation that may mask integration issues
test/active_learning/test_loop.py 4/5 Training loop tests covering device/dtype casting and execution; weak override verification in final test
test/active_learning/test_config.py 3/5 Configuration validation tests; parametrized tests reject infinity but default max_active_learning_steps is infinity
test/active_learning/test_checkpointing.py 5/5 Comprehensive checkpoint save/load/resume tests with phase verification and queue serialization
test/active_learning/test_registry.py 4/5 Registry tests for registration and construction; misleading test name for get_class_no_module_path
test/active_learning/conftest.py 4/5 Test fixtures with mock strategies; all fixtures use autouse=True unnecessarily
examples/active_learning/moons/moon_data.py 4/5 Dataset with active learning protocol support; off-by-one error in label assignment boundary
examples/active_learning/moons/README.md 4/5 Comprehensive example documentation; missing checkpoint timing details and unverified static capture default claim
CHANGELOG.md 5/5 Added entry documenting new active learning namespace following Keep a Changelog format
physicsnemo/active_learning/README.md 5/5 Module overview documentation explaining AL workflow and architecture

Confidence score: 2.5/5

  • This PR introduces foundational infrastructure but has several critical issues that could cause incorrect behavior in production active learning workflows
  • Score reflects multiple critical bugs: LR scheduler stepping per batch instead of per epoch (loop.py:505), precision/recall calculations with incorrect formulas (moon_strategies.py:196-223), potential optimizer state corruption from device changes after init (loop.py:468), missing validation function caching causing performance degradation, and test inconsistencies around infinity validation
  • Pay close attention to physicsnemo/active_learning/loop.py (LR scheduler and device handling), examples/active_learning/moons/moon_strategies.py (metric calculations), physicsnemo/active_learning/driver.py (optimizer state handling), and test/active_learning/test_config.py (validation logic consistency)

21 files reviewed, 18 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent updates consist primarily of copyright year updates (2024 → 2025) across all modified files in the active learning abstraction. No functional code changes were made in this iteration - all the core implementation (protocols, driver orchestration, configuration dataclasses, training loop, registry, logging, examples, and comprehensive test suite) remains identical to the previous review. The copyright updates ensure legal compliance as the code transitions into 2025 and align with NVIDIA's standard licensing header requirements.

Important Files Changed

Filename Score Overview
physicsnemo/active_learning/logger.py 5/5 Copyright year updated to2025; no functional changes to logging system
physicsnemo/active_learning/init.py 5/5 Copyright year updated to 2025; public API exports unchanged
test/active_learning/conftest.py 5/5 Copyright year updated to 2025; test fixtures unchanged
test/active_learning/test_checkpointing.py 5/5 Copyright year updated to 2025; checkpointing tests unchanged
test/active_learning/test_driver.py 5/5 Copyright year updated to 2025; driver tests unchanged
examples/active_learning/moons/moon_data.py 5/5 Copyright year updated to 2025; moon dataset implementation unchanged
physicsnemo/active_learning/protocols.py 5/5 Copyright year updated to 2025; protocol abstractions unchanged
physicsnemo/active_learning/_registry.py 5/5 Copyright year updated to 2025; registry implementation unchanged
test/active_learning/init.py 5/5 Copyright year updated to 2025; empty package marker unchanged
examples/active_learning/moons/moon_strategies.py 5/5 Copyright year updated to 2025; strategy implementations unchanged
test/active_learning/test_loop.py 5/5 Copyright year updated to 2025; training loop tests unchanged
test/active_learning/test_registry.py 5/5 Copyright year updated to 2025; registry tests unchanged
physicsnemo/active_learning/loop.py 5/5 Copyright year updated to 2025; DefaultTrainingLoop implementation unchanged
physicsnemo/active_learning/config.py 5/5 Copyright year updated to 2025; configuration dataclasses unchanged
physicsnemo/active_learning/driver.py 5/5 Copyright year updated to 2025; Driver orchestrator unchanged
examples/active_learning/moons/moon_example.py 5/5 Copyright year updated to 2025; example demonstration unchanged
test/active_learning/test_config.py 5/5 Copyright year updated to 2025; configuration tests unchanged

Confidence score: 5/5

  • This iteration contains only administrative copyright header updates with no functional code changes, making it safe to merge with minimal risk
  • Score reflects that copyright year updates are purely cosmetic changes that cannot introduce bugs or break existing functionality
  • No files require special attention as all changes are identical in nature (copyright year 2024 → 2025)

17 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@ktangsali
Copy link
Collaborator

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent changes are documentation improvements to the active learning module, specifically enhancing docstring examples. Three files were updated: logger.py now includes complete import statements in the log_context docstring example (lines 311-313); _registry.py has updated all docstring import paths from registry to _registry to reflect the private module naming convention (lines 65, 165, 249); and protocols.py has both added import statements to decorator examples and converted pseudocode sections to proper reStructuredText code blocks for better Sphinx rendering (lines 914-985, 1103-1135). These changes ensure docstring examples are self-contained, copy-pasteable, and aligned with standard documentation practices for the new active learning abstraction.

PR Description Notes:

  • Line 1 contains <!-- markdownlint-disable MD013--> which should be <!-- markdownlint-disable MD013 --> (space before closing)

Important Files Changed

Filename Score Overview
physicsnemo/active_learning/logger.py 5/5 Added import statements to log_context docstring example for completeness
physicsnemo/active_learning/_registry.py 4/5 Updated docstring imports to reference private module path _registry; existing O(n) lookup inefficiency at line 314 remains
physicsnemo/active_learning/protocols.py 5/5 Added imports to decorator examples and converted pseudocode to proper code blocks

Confidence score: 5/5

  • This PR is safe to merge with minimal risk - these are documentation-only changes with no functional impact
  • Score reflects straightforward documentation improvements that enhance example clarity without modifying runtime behavior or introducing new logic
  • No files require special attention - all changes are cosmetic docstring enhancements that improve developer experience

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes focus on refining documentation quality in the active learning module. Specifically, the updates improve docstring examples in the registry system by using more distinctive strategy names (my_new_strategy, my_latest_strategy) to make examples clearer and avoid confusion from name reuse. Additionally, protocol documentation was simplified by removing advanced usage examples (static graph capture optimizations) and keeping only minimal viable implementations with proper import torch statements. These changes maintain a clean separation between interface specifications (protocols) and implementation optimizations, ensuring documentation focuses on "what" to implement rather than "how" to optimize, which aligns with good abstraction design in the active learning framework.

Important Files Changed

Filename Score Overview
physicsnemo/active_learning/_registry.py 4/5 Updated docstring examples with more descriptive strategy names; minor performance concern using property instead of direct dict access
physicsnemo/active_learning/protocols.py 5/5 Simplified protocol documentation by removing advanced static capture examples and adding missing imports to minimal examples

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • Documentation-only changes with no functional modifications; updates improve clarity and maintain proper abstraction boundaries between protocols and implementations
  • No files require special attention

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@ktangsali
Copy link
Collaborator

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. Recent updates include addressing previous feedback on metric calculations, error handling, and documentation clarity. The main changes involve fixing precision/recall calculation inconsistencies in moon_strategies.py, adjusting validation logic in configuration classes, and refining checkpoint handling in the driver and training loop. The developer has also improved test coverage for edge cases and added clarifying comments to example code.

Important Files Changed

Filename Score Overview
examples/active_learning/moons/moon_strategies.py 2/5 Critical: Precision and recall calculate metrics for different classes (class 1 vs class 0), producing incorrect F1 scores; class mismatch between docstrings and implementation
physicsnemo/active_learning/driver.py 3/5 Extensive enum-to-string conversion issues throughout causing incorrect directory paths and attribute lookups; will break checkpoint paths and phase context management at runtime
physicsnemo/active_learning/config.py 4/5 Redundant validation logic in DriverConfig and inconsistent device serialization that could fail at runtime; otherwise well-structured configuration system
physicsnemo/active_learning/loop.py 4/5 Device override logic uses self.device instead of local variable, causing incorrect device placement; also uses weights_only=False without safety documentation
physicsnemo/active_learning/logger.py 4/5 Extra keys filtering includes internal logging attributes that pollute JSON output; file handler uses mode='w' which overwrites logs on restart
test/active_learning/conftest.py 4/5 All fixtures configured with autouse=True causing them to run unnecessarily for every test; creates resource waste and potential side effects
test/active_learning/test_loop.py 4/5 test_train_step_fn_override_in_call doesn't validate override actually occurred; device/dtype tests don't verify values were applied to batch data
examples/active_learning/moons/moon_example.py 4/5 RERUN hard-coded to True causes checkpoint overwrite by default; reuses trained model instance for checkpoint reload which may cause confusion
physicsnemo/active_learning/_registry.py 5/5 Clean registry implementation with proper validation; enables plugin-style architecture for user-defined strategies
physicsnemo/active_learning/init.py 4.5/5 Well-structured public API; protocols module excluded from __all__ may limit discoverability for users implementing custom components
test/active_learning/test_registry.py 4/5 Test name/implementation mismatch and direct access to private _registry attribute; otherwise comprehensive coverage
test/active_learning/test_driver.py 4/5 Comprehensive driver orchestration tests; queues populated only once but loop runs multiple times may not validate intended behavior
test/active_learning/test_checkpointing.py 4.5/5 Thorough checkpoint/resume validation with good edge case coverage; weights_only=False acceptable in test context
test/active_learning/test_config.py 5/5 Comprehensive configuration validation tests with good parametrization and edge case coverage
examples/active_learning/moons/README.md 5/5 Well-structured documentation covering usage, implementation, and customization options
examples/active_learning/moons/.gitignore 5/5 Standard practice to exclude runtime-generated artifacts
CHANGELOG.md 5/5 Properly documents new active learning feature following Keep a Changelog format
physicsnemo/active_learning/README.md 5/5 Clear documentation introducing the active learning workflow and module structure
test/active_learning/init.py 5/5 Standard test package initialization with proper licensing

Confidence score: 3/5

  • This PR contains critical logic issues that will cause runtime failures, particularly enum-to-string conversions in the driver and metric calculation errors in the example strategies
  • Score reflects multiple logic issues across core files: driver enum handling breaks checkpoint paths and phase contexts, strategy metrics produce incorrect F1 scores due to class mismatches, and loop device override logic doesn't use the correct variable
  • Pay close attention to physicsnemo/active_learning/driver.py (enum string conversions at lines 559, 1157, 1257, 1270,1290, 1366), examples/active_learning/moons/moon_strategies.py (precision/recall class mismatch at lines 176-226), and physicsnemo/active_learning/loop.py (device variable usage at line 465)

21 files reviewed, 11 comments

Edit Code Review Agent Settings | Greptile

@laserkelvin laserkelvin merged commit 8d018f1 into NVIDIA:main Oct 27, 2025
1 check passed
@laserkelvin laserkelvin deleted the active-learning-abstraction branch October 27, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants