Skip to content

Conversation

@aas008
Copy link
Collaborator

@aas008 aas008 commented Nov 3, 2025

Fixes two critical bugs in baseline trial integration (PR #111):

  1. Baseline trials not counted in completed_trials

    • Baseline trials were creating Optuna trial objects but not incrementing self.completed_trials counter
    • This caused the optimization loop to think no trials had run
    • Result: System would try to run n_trials MORE trials after baseline
    • If trials failed quickly, created thousands of phantom failed trials
    • Fix: Increment self.completed_trials after each baseline completion (success, failure, timeout, or exception)
  2. Baseline trials blocking optimization trials in duplicate check

    • Failed baseline trials (with empty params {}) were being included in duplicate parameter detection
    • Optimization trials with empty params were incorrectly flagged as duplicates of the failed baseline
    • Result: Optimization trials skipped with "duplicate parameters" warning
    • Fix: Exclude baseline trials from duplicate check by checking is_baseline user attribute

Summary by CodeRabbit

  • Bug Fixes
    • Improved accuracy of trial completion tracking for baseline experiments across all outcome states (success, failure, timeout, exception).
    • Enhanced baseline trial handling to ensure proper reporting to the optimization framework.
    • Fixed cache cleanup for baseline trial data to prevent memory accumulation.

Fixes two critical bugs in baseline trial integration (PR openshift-psap#111):

1. **Baseline trials not counted in completed_trials**
   - Baseline trials were creating Optuna trial objects but not incrementing
     self.completed_trials counter
   - This caused the optimization loop to think no trials had run
   - Result: System would try to run n_trials MORE trials after baseline
   - If trials failed quickly, created thousands of phantom failed trials
   - Fix: Increment self.completed_trials after each baseline completion
     (success, failure, timeout, or exception)

2. **Baseline trials blocking optimization trials in duplicate check**
   - Failed baseline trials (with empty params {}) were being included in
     duplicate parameter detection
   - Optimization trials with empty params were incorrectly flagged as
     duplicates of the failed baseline
   - Result: Optimization trials skipped with "duplicate parameters" warning
   - Fix: Exclude baseline trials from duplicate check by checking
     is_baseline user attribute
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

Walkthrough

The StudyController class is modified to handle baseline trials distinctly. Baseline trials are excluded from duplicate detection, and completion counters are properly incremented across all baseline trial outcomes (success, failure, timeout, exception). Trial objects are cleaned from cache after processing.

Changes

Cohort / File(s) Summary
Baseline trial handling refinements
auto_tune_vllm/core/study_controller.py
Modified _is_duplicate_trial_ to exclude baseline trials from duplicate detection; updated _run_baseline_trials_ to increment completed_trials counter for all baseline outcomes (success, failure, timeout, exception); ensured baseline trial objects are cleared from cache after processing

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Verify that baseline trials are correctly excluded from duplicate detection logic without affecting non-baseline trial comparisons
  • Confirm completed_trials increments occur consistently across all baseline trial outcome paths (success, failure, timeout, exception)
  • Validate that trial object cleanup from cache happens properly in all code paths to prevent memory leaks

Possibly related PRs

  • PR #111: Both PRs modify the same StudyController codepaths (_is_duplicate_trial_ and _run_baseline_trials_) to implement special baseline trial handling for caching, result reporting to Optuna, and completion accounting.

Suggested reviewers

  • thameem-abbas
  • ephoris

Poem

🐰 Baselines hop through with special care,
No duplicates to deter their rare
Journey through the trial queue,
Counting each completion true,
Clean cache, clean slate, all fair!

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely summarizes the two main bugs being fixed: baseline trial counter issues and duplicate check problems, matching the core objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44c1d70 and 8e507cc.

📒 Files selected for processing (1)
  • auto_tune_vllm/core/study_controller.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
auto_tune_vllm/core/study_controller.py (1)
auto_tune_vllm/execution/trial_controller.py (1)
  • TrialState (24-28)
🔇 Additional comments (2)
auto_tune_vllm/core/study_controller.py (2)

738-743: Correct fix for baseline trial duplicate check issue.

Excluding baseline trials from duplicate detection ensures that optimization trials with the same parameters (e.g., empty params or default values) can still run even if a baseline trial with those parameters failed. This is the correct behavior since baseline trials use fixed parameters while optimization trials are sampled independently.


1178-1179: Comprehensive fix for baseline trial counter bug.

All four completion paths (success, failure, timeout, exception) now correctly increment completed_trials. This ensures that baseline trials count toward the total trial budget and prevents the optimization loop from attempting to run additional trials beyond the configured n_trials.

The increments are appropriately placed after study.tell() calls, ensuring trials are properly recorded in Optuna before being counted as complete.

Also applies to: 1191-1192, 1215-1216, 1232-1233


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@aas008 aas008 requested review from ephoris and thameem-abbas and removed request for thameem-abbas November 3, 2025 19:44
Copy link
Collaborator

@ephoris ephoris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@thameem-abbas thameem-abbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aas008 aas008 merged commit 7b6e8a5 into openshift-psap:main Nov 3, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants