Enhance memory management for CUDA and NCCL #571

racimrl · 2025-09-05T23:08:21Z

Description

Implements comprehensive CUDA memory management and NCCL error handling to resolve Out of Memory (OOM) crashes during 70B model evaluation and distributed checkpoint operations.

Related Issue(s)

CUDA OOM during FSDP all-gather operations in evaluate_model calls
NCCL "unhandled cuda error" failures during distributed checkpoint loading
Memory fragmentation causing gradual memory exhaustion across training windows

Type of Change

Feature (adding new functionality)
Fix (resolving a bug or issue)
Docs (documentation updates)
Refactor (code changes that don't affect functionality)
Maintenance (dependency updates or other maintenance)
Tests (adding or improving tests)
Breaking change (fix or feature with incompatible API changes)
Other: _____

Branch Naming

My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

My commits are small, atomic, and have proper commit messages
Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

I've performed a self-review of my code
I've added appropriate docstrings following the project's conventions
I've added proper logging where necessary (without trailing periods)
I've applied linting and formatting with Ruff
My code generates no new warnings

Testing

I've added tests for new functionality or bug fixes
All tests pass locally with my changes
Test coverage has not decreased

Documentation

I've updated documentation to reflect my changes
I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

New Features
- Adds proactive GPU memory health checks and automatic cleanup to keep runs stable.
- Offers richer, optional GPU memory usage logging for easier troubleshooting.
Bug Fixes
- Reduces CUDA OOM and NCCL-related crashes with safer retry/skip behavior during evaluation and checkpointing.
- Improves resilience when saving/loading checkpoints; operations proceed when recovery is possible.
- Minimizes memory spikes by clearing caches around save/restore and gradient steps.
Documentation
- Clarifies runtime logs with more informative memory-related messages.

Implements comprehensive CUDA memory management and NCCL error handling to resolve Out of Memory (OOM) crashes during 70B model evaluation and distributed checkpoint operations.

coderabbitai · 2025-09-05T23:08:29Z

Walkthrough

Adds environment defaults for CUDA/NCCL, introduces a GPU memory check/cleanup helper, integrates memory hygiene and OOM/NCCL handling across evaluation, gradient application, and checkpoint save/load, and adds detailed memory logging. Minor non-semantic reformatting is included. All changes are confined to neurons/validator.py.

Changes

Cohort / File(s)	Summary
CUDA/NCCL env defaults `neurons/validator.py`	Sets PYTORCH_CUDA_ALLOC_CONF, NCCL_DEBUG, NCCL_IB_DISABLE, NCCL_SOCKET_IFNAME, NCCL_P2P_DISABLE at import if unset.
Memory health helper `neurons/validator.py`	Adds Validator._check_memory_and_cleanup(context) to assess GPU usage, log on master, run torch.cuda.empty_cache() and gc, and gate operations based on thresholds.
Model save/restore hygiene `neurons/validator.py`	Clears GPU cache and GC before computing state dict and before restoring state.
Evaluation loop safeguards `neurons/validator.py`	Pre-check memory before baseline and per-UID evaluation; OOM try/except with one retry; skip windows/UIDs on insufficient memory; add post-step cleanups and optional memory logs.
Gradient application and step cleanup `neurons/validator.py`	Inserts explicit memory cleanups after gradient application and checkpoint-related boundaries; optional debug memory logs.
Checkpoint save resilience (FSDP) `neurons/validator.py`	Memory check before saving; wraps save/upload in try/except for CUDA OOM/NCCL; cleans up and proceeds without checkpoint on failure.
Checkpoint load robustness `neurons/validator.py`	Clears memory and checks before load; on failure, aggressive cleanup and single retry; if still failing, continues without loading.
Minor reformatting `neurons/validator.py`	Non-semantic line reformatting in score updates.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant V as Validator
  participant GPU as CUDA Device
  participant Log as Logger

  rect rgba(230,245,255,0.5)
  note over V,GPU: Baseline Evaluation Gate
  V->>GPU: _check_memory_and_cleanup("baseline")
  alt sufficient memory
    V->>V: run baseline eval
  else insufficient
    V->>Log: warn "skip baseline"
    V-->>V: continue without baseline
  end
  end

  loop For each UID
    rect rgba(240,255,230,0.5)
    V->>GPU: _check_memory_and_cleanup("uid")
    alt sufficient memory
      V->>V: evaluate UID
      alt CUDA OOM during eval
        V->>GPU: empty_cache + GC
        V->>V: retry once
        alt retry success
          V->>Log: info "retry succeeded"
        else retry OOM
          V->>Log: warn "skip UID"
        end
      end
      V->>GPU: post-step cleanup (empty_cache + GC)
    else insufficient
      V->>Log: warn "skip UID"
    end
    end
  end

sequenceDiagram
  autonumber
  participant V as Validator
  participant FS as Filesystem/Checkpoint
  participant GPU as CUDA Device
  participant Log as Logger

  rect rgba(255,245,230,0.5)
  note over V,FS: Save FSDP Checkpoint
  V->>GPU: _check_memory_and_cleanup("save")
  alt sufficient
    V->>FS: save local shard
    alt save/upload OOM/NCCL
      V->>GPU: cleanup (empty_cache + GC)
      V->>Log: warn "skip checkpoint"
    else success
      V->>Log: info "checkpoint saved"
    end
  else insufficient
    V->>Log: warn "skip checkpoint (mem)"
  end
  end

  rect rgba(255,250,230,0.5)
  note over V,FS: Load Checkpoint
  V->>GPU: cleanup + _check_memory_and_cleanup("load")
  alt sufficient
    V->>FS: load checkpoint
    alt load OOM/NCCL
      V->>GPU: aggressive cleanup
      V->>FS: retry load once
      alt retry fails
        V->>Log: warn "continue without checkpoint"
      else retry success
        V->>Log: info "loaded"
      end
    end
  else insufficient
    V->>Log: error "insufficient memory"
  end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I tidied the VRAM with a whisker’s sweep,
Shushed the NCCL gremlins back to sleep.
Skip a checkpoint? I won’t despair—
I’ll hop around OOMs with careful care.
Cache now cleared, logs aglow—
Onward I bound, to validate and go! 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

neurons/validator.py (1)
1697-1698: Guard unconditional CUDA cache calls.

torch.cuda.empty_cache() on CPU‑only builds raises. Wrap with torch.cuda.is_available() everywhere it’s unguarded.

Examples:
-                torch.cuda.empty_cache()
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()
Apply at the listed locations.

Also applies to: 1811-1812, 1958-1959, 2351-2352, 2407-2408, 3158-3159, 3211-3212

🧹 Nitpick comments (1)

neurons/validator.py (1)

161-166: Minor: avoid repeated inline gc imports.

Import gc once at module scope and reuse to reduce import churn.

Also applies to: 177-182

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 64e98a2 and e492129.

📒 Files selected for processing (1)

neurons/validator.py (17 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

neurons/validator.py (4)

src/tplr/logging.py (2)

log_with_context (207-223)

log_with_context (290-309)

src/tplr/sharded_sampler.py (1)

set_window_uid (74-79)

neurons/trainer.py (1)

evaluate_model (340-392)

src/tplr/dcp_checkpoint.py (3)

save_local_async (222-268)

upload (271-404)

download_and_load (657-680)

🪛 GitHub Actions: CI

neurons/validator.py

[error] 69-69: I001 Import block is un-sorted or un-formatted.

🔇 Additional comments (2)

neurons/validator.py (2)

3394-3405: Checkpoint load robustness LGTM.

Pre‑clear, memory check, and single retry with aggressive cleanup are appropriate.

Also applies to: 3409-3442

2642-2645: Vectorized inactivity penalty LGTM.

Cleaner and efficient in‑place scaling for positive scores only.

coderabbitai · 2025-09-05T23:16:45Z

neurons/validator.py

+# Set CUDA memory allocator configuration to prevent fragmentation
+import os
+if not os.environ.get('PYTORCH_CUDA_ALLOC_CONF'):
+    os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
+
+# Set NCCL environment variables for better memory management and debugging
+if not os.environ.get('NCCL_DEBUG'):
+    os.environ['NCCL_DEBUG'] = 'WARN'  # Change to INFO for more verbose debugging
+if not os.environ.get('NCCL_IB_DISABLE'):
+    os.environ['NCCL_IB_DISABLE'] = '1'  # Disable InfiniBand to use Ethernet
+if not os.environ.get('NCCL_SOCKET_IFNAME'):
+    os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'  # Use ethernet interface
+if not os.environ.get('NCCL_P2P_DISABLE'):
+    os.environ['NCCL_P2P_DISABLE'] = '1'  # Disable P2P to reduce memory pressure
+


🛠️ Refactor suggestion

⚠️ Potential issue

Fix env var ordering, avoid hard-coded NCCL defaults, and remove duplicate import (CI I001).

PYTORCH_CUDA_ALLOC_CONF and NCCL vars must be set before any CUDA use; here they’re set after torch.cuda.manual_seed_all and cudnn toggles, so they may be ignored and/or ineffective.

Forcing NCCL_IB_DISABLE=1, NCCL_P2P_DISABLE=1, and NCCL_SOCKET_IFNAME=eth0 by default can severely degrade performance or break networking on many setups.

Duplicate import os here triggers the CI isort error.

Action:

Move this env configuration above the first Torch/CUDA usage (ideally before importing torch).

Don’t force IB/P2P/interface defaults globally; make them opt-in via config/env.

Drop the duplicate import to fix CI.

Apply this minimal diff to fix the CI failure now (removes the duplicate import); follow-up placement change is shown below:

-# Set CUDA memory allocator configuration to prevent fragmentation -import os -if not os.environ.get('PYTORCH_CUDA_ALLOC_CONF'): +# Set CUDA memory allocator configuration to prevent fragmentation +if not os.environ.get('PYTORCH_CUDA_ALLOC_CONF'): os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

Recommended relocation (outside this hunk): place the env setup immediately after the stdlib imports and before any torch import/usage:

# --- stdlib imports --- import os # ... other stdlib imports ... # Must run BEFORE importing torch / using CUDA os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True') # Make NCCL tweaks opt‑in to avoid breaking fast paths by default if os.environ.get('TPLR_ENABLE_NCCL_TWEAKS') == '1': os.environ.setdefault('NCCL_DEBUG', 'WARN') # Only set these if you explicitly opt‑in; otherwise leave system defaults if os.environ.get('TPLR_DISABLE_NCCL_IB') == '1': os.environ.setdefault('NCCL_IB_DISABLE', '1') if os.environ.get('TPLR_DISABLE_NCCL_P2P') == '1': os.environ.setdefault('NCCL_P2P_DISABLE', '1') # Don’t hard‑code an interface; allow override via TPLR_NCCL_SOCKET_IFNAME if os.environ.get('TPLR_NCCL_SOCKET_IFNAME'): os.environ.setdefault('NCCL_SOCKET_IFNAME', os.environ['TPLR_NCCL_SOCKET_IFNAME']) # Now import torch / use CUDA import torch

🧰 Tools

🪛 GitHub Actions: CI

[error] 69-69: I001 Import block is un-sorted or un-formatted.

🤖 Prompt for AI Agents

In neurons/validator.py around lines 68-82 there is a duplicate import os and environment variable setup for PYTORCH_CUDA_ALLOC_CONF and NCCL that is placed after CUDA/torch usage and force-sets NCCL defaults; remove the duplicate import to resolve the CI isort error, and make the NCCL tweaks opt-in and non-invasive by using os.environ.setdefault for PYTORCH_CUDA_ALLOC_CONF and gating NCCL_IB_DISABLE, NCCL_P2P_DISABLE and NCCL_SOCKET_IFNAME behind explicit opt-in flags (e.g. TPLR_ENABLE_NCCL_TWEAKS and per-feature flags), and plan to relocate this entire env setup to run before any torch import/usage (immediately after stdlib imports) in a follow-up commit.

coderabbitai · 2025-09-05T23:16:45Z

neurons/validator.py

+    def _check_memory_and_cleanup(self, context: str = "unknown") -> bool:
+        """Check GPU memory usage and perform cleanup if needed.
+        Returns True if memory is available, False if critically low.
+        """
+        if not torch.cuda.is_available():
+            return True
+
+        # Get memory info
+        memory_allocated = torch.cuda.memory_allocated()
+        memory_reserved = torch.cuda.memory_reserved()
+        memory_total = torch.cuda.get_device_properties(0).total_memory
+
+        # Calculate usage percentages
+        allocated_pct = (memory_allocated / memory_total) * 100
+        reserved_pct = (memory_reserved / memory_total) * 100
+
+        # Log memory usage
+        if self.is_master:
+            tplr.log_with_context(
+                level="debug",
+                message=f"Memory check ({context}) - Allocated: {allocated_pct:.1f}% ({memory_allocated/1024**3:.2f}GB), "
+                        f"Reserved: {reserved_pct:.1f}% ({memory_reserved/1024**3:.2f}GB)",
+                sync_window=self.sync_window,
+                current_window=self.current_window,
+            )
+
+        # If allocated memory is over 85%, perform cleanup
+        if allocated_pct > 85.0:
+            if self.is_master:
+                tplr.log_with_context(
+                    level="warning",
+                    message=f"High memory usage detected ({allocated_pct:.1f}%), performing cleanup",
+                    sync_window=self.sync_window,
+                    current_window=self.current_window,
+                )
+            torch.cuda.empty_cache()
+            import gc
+            gc.collect()
+            torch.cuda.synchronize()
+
+            # Check again after cleanup
+            new_allocated = torch.cuda.memory_allocated()
+            new_allocated_pct = (new_allocated / memory_total) * 100
+
+            # If still critically low after cleanup, return False
+            if new_allocated_pct > 90.0:
+                if self.is_master:
+                    tplr.log_with_context(
+                        level="error",
+                        message=f"Critical memory usage even after cleanup ({new_allocated_pct:.1f}%)",
+                        sync_window=self.sync_window,
+                        current_window=self.current_window,
+                    )
+                return False
+
+        return True
+


⚠️ Potential issue

Memory percentages computed against the wrong device (device 0).

On multi‑GPU, memory_allocated() uses the current device, but get_device_properties(0) is always GPU 0. This skews the percentage and decisions.

Apply:

- # Get memory info - memory_allocated = torch.cuda.memory_allocated() - memory_reserved = torch.cuda.memory_reserved() - memory_total = torch.cuda.get_device_properties(0).total_memory + # Get memory info on the current device + dev = torch.cuda.current_device() + memory_allocated = torch.cuda.memory_allocated(device=dev) + memory_reserved = torch.cuda.memory_reserved(device=dev) + memory_total = torch.cuda.get_device_properties(dev).total_memory

Optional: consider thresholding on reserved_pct as well, and making thresholds tunable via hparams.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _check_memory_and_cleanup(self, context: str = "unknown") -> bool:

"""Check GPU memory usage and perform cleanup if needed.

Returns True if memory is available, False if critically low.

"""

if not torch.cuda.is_available():

return True

# Get memory info

memory_allocated = torch.cuda.memory_allocated()

memory_reserved = torch.cuda.memory_reserved()

memory_total = torch.cuda.get_device_properties(0).total_memory

# Calculate usage percentages

allocated_pct = (memory_allocated / memory_total) * 100

reserved_pct = (memory_reserved / memory_total) * 100

# Log memory usage

if self.is_master:

tplr.log_with_context(

level="debug",

message=f"Memory check ({context}) - Allocated: {allocated_pct:.1f}% ({memory_allocated/1024**3:.2f}GB), "

f"Reserved: {reserved_pct:.1f}% ({memory_reserved/1024**3:.2f}GB)",

sync_window=self.sync_window,

current_window=self.current_window,

)

# If allocated memory is over 85%, perform cleanup

if allocated_pct > 85.0:

if self.is_master:

tplr.log_with_context(

level="warning",

message=f"High memory usage detected ({allocated_pct:.1f}%), performing cleanup",

sync_window=self.sync_window,

current_window=self.current_window,

)

torch.cuda.empty_cache()

import gc

gc.collect()

torch.cuda.synchronize()

# Check again after cleanup

new_allocated = torch.cuda.memory_allocated()

new_allocated_pct = (new_allocated / memory_total) * 100

# If still critically low after cleanup, return False

if new_allocated_pct > 90.0:

if self.is_master:

tplr.log_with_context(

level="error",

message=f"Critical memory usage even after cleanup ({new_allocated_pct:.1f}%)",

sync_window=self.sync_window,

current_window=self.current_window,

)

return False

return True

def _check_memory_and_cleanup(self, context: str = "unknown") -> bool:

"""Check GPU memory usage and perform cleanup if needed.

Returns True if memory is available, False if critically low.

"""

if not torch.cuda.is_available():

return True

# Get memory info on the current device

dev = torch.cuda.current_device()

memory_allocated = torch.cuda.memory_allocated(device=dev)

memory_reserved = torch.cuda.memory_reserved(device=dev)

memory_total = torch.cuda.get_device_properties(dev).total_memory

# Calculate usage percentages

allocated_pct = (memory_allocated / memory_total) * 100

reserved_pct = (memory_reserved / memory_total) * 100

# Log memory usage

if self.is_master:

tplr.log_with_context(

level="debug",

message=f"Memory check ({context}) - Allocated: {allocated_pct:.1f}% ({memory_allocated/1024**3:.2f}GB), "

f"Reserved: {reserved_pct:.1f}% ({memory_reserved/1024**3:.2f}GB)",

sync_window=self.sync_window,

current_window=self.current_window,

)

# If allocated memory is over 85%, perform cleanup

if allocated_pct > 85.0:

if self.is_master:

tplr.log_with_context(

level="warning",

message=f"High memory usage detected ({allocated_pct:.1f}%), performing cleanup",

sync_window=self.sync_window,

current_window=self.current_window,

)

torch.cuda.empty_cache()

import gc

gc.collect()

torch.cuda.synchronize()

# Check again after cleanup

new_allocated = torch.cuda.memory_allocated(device=dev)

new_allocated_pct = (new_allocated / memory_total) * 100

# If still critically low after cleanup, return False

if new_allocated_pct > 90.0:

if self.is_master:

tplr.log_with_context(

level="error",

message=f"Critical memory usage even after cleanup ({new_allocated_pct:.1f}%)",

sync_window=self.sync_window,

current_window=self.current_window,

)

return False

return True

🤖 Prompt for AI Agents

In neurons/validator.py around lines 184-240, the code calculates GPU memory percentages against device 0 (get_device_properties(0)) while torch.cuda.memory_allocated() and memory_reserved() use the current CUDA device, producing incorrect percentages on multi-GPU. Fix by querying the actual current device (e.g., torch.cuda.current_device() or torch.cuda.device_index()) and use that device id for get_device_properties(...); recompute memory_total from that device so allocated_pct and reserved_pct are correct. Also make the cleanup thresholds (allocated and optional reserved) configurable via validator hparams/params and, if desired, add a reserved_pct check mirroring the allocated_pct logic.

coderabbitai · 2025-09-05T23:16:45Z

neurons/validator.py

+            # Check memory before baseline evaluation
+            if not self._check_memory_and_cleanup("before_baseline_eval"):
+                if self.is_master:
+                    tplr.log_with_context(
+                        level="critical",
+                        message="Insufficient memory for baseline evaluation. Skipping evaluation window.",
+                        sync_window=self.sync_window,
+                        current_window=self.current_window,
+                    )
+                continue
+
+            try:


⚠️ Potential issue

Avoid rank divergence: baseline memory check must be a consensus, not per‑rank early-continue.

A single rank hitting continue here will desynchronize later barriers and can deadlock. Use an all‑rank gate and keep ranks aligned.

- # Check memory before baseline evaluation - if not self._check_memory_and_cleanup("before_baseline_eval"): - if self.is_master: - tplr.log_with_context( - level="critical", - message="Insufficient memory for baseline evaluation. Skipping evaluation window.", - sync_window=self.sync_window, - current_window=self.current_window, - ) - continue + # Check memory before baseline evaluation (global consensus) + mem_ok_local = self._check_memory_and_cleanup("before_baseline_eval") + mem_ok_global = dist_helper.all_ok(mem_ok_local, self.device, tag="mem_ok_baseline") + if not mem_ok_global: + if self.is_master: + tplr.log_with_context( + level="critical", + message="Insufficient memory for baseline evaluation on at least one rank. Skipping this window.", + sync_window=self.sync_window, + current_window=self.current_window, + ) + dist_helper.safe_barrier("bail_baseline_eval", self.local_rank) + self.global_step += 1 + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check memory before baseline evaluation

if not self._check_memory_and_cleanup("before_baseline_eval"):

if self.is_master:

tplr.log_with_context(

level="critical",

message="Insufficient memory for baseline evaluation. Skipping evaluation window.",

sync_window=self.sync_window,

current_window=self.current_window,

)

continue

try:

# Check memory before baseline evaluation (global consensus)

mem_ok_local = self._check_memory_and_cleanup("before_baseline_eval")

mem_ok_global = dist_helper.all_ok(mem_ok_local, self.device, tag="mem_ok_baseline")

if not mem_ok_global:

if self.is_master:

tplr.log_with_context(

level="critical",

message="Insufficient memory for baseline evaluation on at least one rank. Skipping this window.",

sync_window=self.sync_window,

current_window=self.current_window,

)

dist_helper.safe_barrier("bail_baseline_eval", self.local_rank)

self.global_step += 1

continue

try:

🤖 Prompt for AI Agents

In neurons/validator.py around lines 1318 to 1329, the memory-check currently does per-rank early-continue which can desynchronize ranks and deadlock; change it to perform a collective consensus (e.g., all-reduce/any-reduce or broadcast a boolean) so every rank knows if any rank failed the memory check, then have all ranks skip the baseline evaluation in lockstep; only the master should log the message once when the consensus indicates insufficient memory; ensure existing barrier/synchronization points remain intact and do not exit early on a single-rank result.

coderabbitai · 2025-09-05T23:16:45Z

neurons/validator.py

+                # Check memory before each UID evaluation
+                if not self._check_memory_and_cleanup(f"before_uid_{eval_uid}"):
+                    if self.is_master:
+                        tplr.log_with_context(
+                            level="warning",
+                            message=f"Insufficient memory for UID {eval_uid} evaluation. Skipping this UID.",
+                            sync_window=self.sync_window,
+                            current_window=self.current_window,
+                            eval_uid=eval_uid,
+                        )
+                    continue
+


⚠️ Potential issue

Per‑UID memory precheck must also be global and still hit the end‑of‑iter barrier.

Early continue on a subset of ranks will skip the per‑iteration "end_eval_iter" barrier and can hang others.

- # Check memory before each UID evaluation - if not self._check_memory_and_cleanup(f"before_uid_{eval_uid}"): - if self.is_master: - tplr.log_with_context( - level="warning", - message=f"Insufficient memory for UID {eval_uid} evaluation. Skipping this UID.", - sync_window=self.sync_window, - current_window=self.current_window, - eval_uid=eval_uid, - ) - continue + # Check memory before each UID evaluation (global consensus) + mem_ok_local = self._check_memory_and_cleanup(f"before_uid_{eval_uid}") + mem_ok_global = dist_helper.all_ok(mem_ok_local, self.device, tag=f"mem_ok_uid_{eval_uid}") + if not mem_ok_global: + if self.is_master: + tplr.log_with_context( + level="warning", + message=f"Insufficient memory on at least one rank for UID {eval_uid}. Skipping UID.", + sync_window=self.sync_window, + current_window=self.current_window, + eval_uid=eval_uid, + ) + # Mark attempted so counters remain consistent + uids_attempted_this_window.add(eval_uid) + self.peers_last_eval_window[eval_uid] = self.sync_window + # Align with the per‑iteration barrier then skip + dist_helper.safe_barrier("end_eval_iter", self.local_rank) + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check memory before each UID evaluation

if not self._check_memory_and_cleanup(f"before_uid_{eval_uid}"):

if self.is_master:

tplr.log_with_context(

level="warning",

message=f"Insufficient memory for UID {eval_uid} evaluation. Skipping this UID.",

sync_window=self.sync_window,

current_window=self.current_window,

eval_uid=eval_uid,

)

continue

# Check memory before each UID evaluation (global consensus)

mem_ok_local = self._check_memory_and_cleanup(f"before_uid_{eval_uid}")

mem_ok_global = dist_helper.all_ok(mem_ok_local, self.device, tag=f"mem_ok_uid_{eval_uid}")

if not mem_ok_global:

if self.is_master:

tplr.log_with_context(

level="warning",

message=f"Insufficient memory on at least one rank for UID {eval_uid}. Skipping UID.",

sync_window=self.sync_window,

current_window=self.current_window,

eval_uid=eval_uid,

)

# Mark attempted so counters remain consistent

uids_attempted_this_window.add(eval_uid)

self.peers_last_eval_window[eval_uid] = self.sync_window

# Align with the per-iteration barrier then skip

dist_helper.safe_barrier("end_eval_iter", self.local_rank)

continue

🤖 Prompt for AI Agents

In neurons/validator.py around lines 1421-1432, the per-UID memory check currently does a local early continue which can cause some ranks to skip the per-iteration "end_eval_iter" barrier and hang others; instead, compute a local boolean (local_ok = self._check_memory_and_cleanup(...)), perform a global reduction (e.g., all-reduce AND across ranks) to produce global_ok, log the warning only on master when global_ok is False, then ensure all ranks hit the end-of-iteration barrier before skipping that UID (i.e., if not global_ok then call the barrier and continue), so no rank leaves the barrier early.

coderabbitai · 2025-09-05T23:16:46Z

neurons/validator.py

+                try:
+                    # Check memory before FSDP checkpoint save
+                    self._check_memory_and_cleanup(operation="FSDP_checkpoint_save")
+
+                    handle = await self.ckpt.save_local_async(
+                        model=self.model,
+                        window=self.sync_window,
+                        sync_window=self.sync_window,
+                        topology="FSDP",
+                    )

-                # Schedule an upload that will wait for the save to finish, then upload in background
-                await self.ckpt.upload(
-                    window=self.sync_window,
-                    background=True,
-                    delete_local_on_success=True,
-                    wait_for=handle,
-                )
+                    # Schedule an upload that will wait for the save to finish, then upload in background
+                    await self.ckpt.upload(
+                        window=self.sync_window,
+                        background=True,
+                        delete_local_on_success=True,
+                        wait_for=handle,
+                    )
+
+                    # Clean up after checkpoint save
+                    torch.cuda.empty_cache()
+
+                except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
+                    if "CUDA out of memory" in str(e) or "NCCL" in str(e):
+                        tplr.logger.warning(f"CUDA/NCCL error during FSDP checkpoint save at step {self.global_step}: {e}")
+                        # Force memory cleanup and continue
+                        torch.cuda.empty_cache()
+                        if torch.cuda.is_available():
+                            torch.cuda.synchronize()
+                        tplr.logger.info("Continuing without checkpoint save due to memory/NCCL constraints")
+                    else:
+                        raise



⚠️ Potential issue

Wrong parameter name in _check_memory_and_cleanup and skip logic not honored.

_check_memory_and_cleanup takes context, not operation → this will raise TypeError.

If memory is critically low, we should skip the save path to avoid failures.

Apply:

- # Check memory before FSDP checkpoint save - self._check_memory_and_cleanup(operation="FSDP_checkpoint_save") - - handle = await self.ckpt.save_local_async( - model=self.model, - window=self.sync_window, - sync_window=self.sync_window, - topology="FSDP", - ) + # Check memory before FSDP checkpoint save + mem_ok = self._check_memory_and_cleanup(context="FSDP_checkpoint_save") + if not mem_ok: + tplr.logger.warning("Skipping FSDP checkpoint save due to low GPU memory.") + else: + handle = await self.ckpt.save_local_async( + model=self.model, + window=self.sync_window, + sync_window=self.sync_window, + topology="FSDP", + ) - # Schedule an upload that will wait for the save to finish, then upload in background - await self.ckpt.upload( - window=self.sync_window, - background=True, - delete_local_on_success=True, - wait_for=handle, - ) + # Schedule an upload that will wait for the save to finish, then upload in background + await self.ckpt.upload( + window=self.sync_window, + background=True, + delete_local_on_success=True, + wait_for=handle, + ) - # Clean up after checkpoint save - torch.cuda.empty_cache() + # Clean up after checkpoint save + if torch.cuda.is_available(): + torch.cuda.empty_cache()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

# Check memory before FSDP checkpoint save

self._check_memory_and_cleanup(operation="FSDP_checkpoint_save")

handle = await self.ckpt.save_local_async(

model=self.model,

window=self.sync_window,

sync_window=self.sync_window,

topology="FSDP",

)

# Schedule an upload that will wait for the save to finish, then upload in background

await self.ckpt.upload(

window=self.sync_window,

background=True,

delete_local_on_success=True,

wait_for=handle,

)

# Schedule an upload that will wait for the save to finish, then upload in background

await self.ckpt.upload(

window=self.sync_window,

background=True,

delete_local_on_success=True,

wait_for=handle,

)

# Clean up after checkpoint save

torch.cuda.empty_cache()

except (torch.cuda.OutOfMemoryError, RuntimeError) as e:

if "CUDA out of memory" in str(e) or "NCCL" in str(e):

tplr.logger.warning(f"CUDA/NCCL error during FSDP checkpoint save at step {self.global_step}: {e}")

# Force memory cleanup and continue

torch.cuda.empty_cache()

if torch.cuda.is_available():

torch.cuda.synchronize()

tplr.logger.info("Continuing without checkpoint save due to memory/NCCL constraints")

else:

raise

try:

# Check memory before FSDP checkpoint save

mem_ok = self._check_memory_and_cleanup(context="FSDP_checkpoint_save")

if not mem_ok:

tplr.logger.warning("Skipping FSDP checkpoint save due to low GPU memory.")

else:

handle = await self.ckpt.save_local_async(

model=self.model,

window=self.sync_window,

sync_window=self.sync_window,

topology="FSDP",

)

# Schedule an upload that will wait for the save to finish, then upload in background

await self.ckpt.upload(

window=self.sync_window,

background=True,

delete_local_on_success=True,

wait_for=handle,

)

# Clean up after checkpoint save

if torch.cuda.is_available():

torch.cuda.empty_cache()

except (torch.cuda.OutOfMemoryError, RuntimeError) as e:

if "CUDA out of memory" in str(e) or "NCCL" in str(e):

tplr.logger.warning(f"CUDA/NCCL error during FSDP checkpoint save at step {self.global_step}: {e}")

# Force memory cleanup and continue

torch.cuda.empty_cache()

if torch.cuda.is_available():

torch.cuda.synchronize()

tplr.logger.info("Continuing without checkpoint save due to memory/NCCL constraints")

else:

raise

Enhance memory management for CUDA and NCCL

e492129

Implements comprehensive CUDA memory management and NCCL error handling to resolve Out of Memory (OOM) crashes during 70B model evaluation and distributed checkpoint operations.

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance memory management for CUDA and NCCL #571

Enhance memory management for CUDA and NCCL #571

Uh oh!

racimrl commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 5, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Enhance memory management for CUDA and NCCL #571

Are you sure you want to change the base?

Enhance memory management for CUDA and NCCL #571

Uh oh!

Conversation

racimrl commented Sep 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Branch Naming

Commit Messages

Code Quality

Testing

Documentation

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

racimrl commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 5, 2025 •

edited

Loading