feat: add GPU memory tracking utilities #16224

codebasecomprehension · 2025-12-31T10:56:31Z

Summary

This PR adds a new memory_utils module with utilities for GPU memory monitoring and optimization in SGLang.

Changes

python/sglang/utils/memory_utils.py: New module with:
- GPUMemoryTracker: Class for tracking memory across multiple GPUs
- track_memory(): Context manager for profiling memory usage
- get_available_gpu_memory(): Query free GPU memory
- get_gpu_memory_utilization(): Get memory usage percentage
- clear_gpu_memory_cache(): Release cached memory
- estimate_model_memory_requirements(): Estimate memory needs
tests/test_memory_utils.py: Unit tests for all functions

Motivation

GPU memory monitoring is essential for:

Optimizing batch sizes during inference
Debugging out-of-memory errors
Planning resource allocation for deployments
Profiling performance of different model configurations

Example Usage

from sglang.utils import track_memory, get_gpu_memory_utilization

# Simple usage
with track_memory("batch_inference"):
    results = model.generate(requests)

# Check current utilization
utilization = get_gpu_memory_utilization()
print(f"GPU usage: {utilization:.1f}%")

Add memory_utils module with GPUMemoryTracker class, track_memory context manager, and helper functions for monitoring GPU memory during LLM inference.

Add comprehensive unit tests for GPUMemoryTracker, track_memory context manager, and helper functions.

gemini-code-assist · 2025-12-31T10:56:45Z

Summary of Changes

Hello @codebasecomprehension, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's capabilities by introducing a robust set of GPU memory management tools. The new utilities enable detailed monitoring, profiling, and estimation of GPU memory usage, which is crucial for optimizing large language model inference, preventing out-of-memory errors, and efficiently planning resource allocation in deployment environments.

Highlights

New GPU Memory Utilities Module: Introduced python/sglang/utils/memory_utils.py, a new module dedicated to GPU memory monitoring and optimization within SGLang.
GPUMemoryTracker Class: Added GPUMemoryTracker, a class designed to track allocated, reserved, and peak memory usage across multiple CUDA devices, providing detailed statistics and summaries.
Memory Tracking Context Manager: Implemented track_memory(), a context manager that simplifies profiling GPU memory usage around specific code blocks, automatically reporting delta and peak memory on exit.
GPU Memory Helper Functions: Provided utility functions including get_available_gpu_memory() to query free memory, get_gpu_memory_utilization() for percentage usage, and clear_gpu_memory_cache() to release cached GPU memory.
Model Memory Estimation: Included estimate_model_memory_requirements(), a function to estimate memory needs for model weights, activations, and KV cache based on parameters and precision.
Comprehensive Unit Tests: Added test/test_memory_utils.py with thorough unit tests covering all new classes and functions to ensure correctness and reliability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable set of GPU memory utilities. The implementation is a good starting point, but I've identified several critical and high-severity issues that need to be addressed. These include incorrect peak memory reporting in GPUMemoryTracker, a buggy implementation of clear_gpu_memory_cache, and misleading calculations in estimate_model_memory_requirements. I've provided detailed comments and suggestions for each of these points.

Additionally, the accompanying tests are quite basic and primarily check API contracts rather than functional correctness. I strongly recommend enhancing the tests to actually allocate memory and verify that the utilities report and manage it as expected. This will be crucial for ensuring the reliability of these tools.

gemini-code-assist · 2025-12-31T10:58:58Z

python/sglang/utils/memory_utils.py

+            current_allocated = torch.cuda.memory_allocated(i)
+            current_reserved = torch.cuda.memory_reserved(i)
+
+            stats[i] = {
+                'allocated_gb': current_allocated / (1024**3),
+                'reserved_gb': current_reserved / (1024**3),
+                'peak_allocated_gb': self._peak_allocated[i] / (1024**3),
+                'max_reserved_gb': self._max_reserved[i] / (1024**3),
+                'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3),
+            }


The stop_tracking method does not update the peak memory usage before returning the statistics. It relies on get_stats being called to update the peak values. If get_stats is not called during the tracking period, the reported peak memory will be incorrect (it will be the same as the initial memory). To fix this, the peak memory should be updated within stop_tracking itself, just before building the stats dictionary.

Suggested change

current_allocated = torch.cuda.memory_allocated(i)

current_reserved = torch.cuda.memory_reserved(i)

stats[i] = {

'allocated_gb': current_allocated / (1024**3),

'reserved_gb': current_reserved / (1024**3),

'peak_allocated_gb': self._peak_allocated[i] / (1024**3),

'max_reserved_gb': self._max_reserved[i] / (1024**3),

'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3),

}

current_allocated = torch.cuda.memory_allocated(i)

current_reserved = torch.cuda.memory_reserved(i)

# Update peak tracking before reporting

self._peak_allocated[i] = max(self._peak_allocated[i], current_allocated)

self._max_reserved[i] = max(self._max_reserved[i], current_reserved)

stats[i] = {

'allocated_gb': current_allocated / (1024**3),

'reserved_gb': current_reserved / (1024**3),

'peak_allocated_gb': self._peak_allocated[i] / (1024**3),

'max_reserved_gb': self._max_reserved[i] / (1024**3),

'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3),

}

gemini-code-assist · 2025-12-31T10:58:58Z

python/sglang/utils/memory_utils.py

+    if device is not None:
+        if device >= 0 and device < torch.cuda.device_count():
+            torch.cuda.empty_cache()
+    else:
+        for _ in range(torch.cuda.device_count()):
+            torch.cuda.empty_cache()


The clear_gpu_memory_cache function is implemented incorrectly. torch.cuda.empty_cache() operates on the current device. The code does not switch to the specified device when it's provided, and when device is None, it repeatedly calls empty_cache() on the current device instead of iterating over all devices. This can lead to memory not being cleared on the intended GPUs.

Suggested change

if device is not None:

if device >= 0 and device < torch.cuda.device_count():

torch.cuda.empty_cache()

else:

for _ in range(torch.cuda.device_count()):

torch.cuda.empty_cache()

if device is not None:

if 0 <= device < torch.cuda.device_count():

with torch.cuda.device(device):

torch.cuda.empty_cache()

else:

for i in range(torch.cuda.device_count()):

with torch.cuda.device(i):

torch.cuda.empty_cache()

gemini-code-assist · 2025-12-31T10:58:58Z

python/sglang/utils/memory_utils.py

+    # Activation memory estimate (rough approximation: ~2x params per layer for bf16)
+    if num_layers is not None:
+        activation_memory = (num_parameters * bytes_per_param * 2) / (1024**3)
+        estimates["activations_gb"] = activation_memory
+
+    # KV cache per token estimate
+    if num_heads is not None and head_dim is not None:
+        # 2 * num_layers * num_heads * head_dim * bytes_per_param
+        # Assuming default 32 layers if not specified
+        layers_for_kv = num_layers or 32
+        kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3)
+        estimates["kv_cache_per_token_gb"] = kv_per_token
+
+    # Total estimate
+    total = weight_memory
+    if "activations_gb" in estimates:
+        total += estimates["activations_gb"]
+    estimates["total_estimate_gb"] = total
+
+    return estimates


There are a couple of issues with the memory estimation logic:

The activation memory estimation (activation_memory) seems incorrect for inference. The formula (num_parameters * bytes_per_param * 2) is a very rough approximation typically used for training, not inference. For inference, activation memory is proportional to batch_size * sequence_length * hidden_size * num_layers, not the number of parameters. This leads to a significant overestimation.

The total_estimate_gb does not include the KV cache memory, which is a critical component of memory usage during inference and can often be larger than the weights themselves.

I recommend removing the activations_gb calculation as it's misleading for inference. The total_estimate_gb should also be removed or its calculation clarified in the docstring to note that it excludes dynamic components like KV cache and activations, which depend on runtime factors like batch size and sequence length.

Suggested change

# Activation memory estimate (rough approximation: ~2x params per layer for bf16)

if num_layers is not None:

activation_memory = (num_parameters * bytes_per_param * 2) / (1024**3)

estimates["activations_gb"] = activation_memory

# KV cache per token estimate

if num_heads is not None and head_dim is not None:

# 2 * num_layers * num_heads * head_dim * bytes_per_param

# Assuming default 32 layers if not specified

layers_for_kv = num_layers or 32

kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3)

estimates["kv_cache_per_token_gb"] = kv_per_token

# Total estimate

total = weight_memory

if "activations_gb" in estimates:

total += estimates["activations_gb"]

estimates["total_estimate_gb"] = total

return estimates

# Activation memory for inference is dynamic (proportional to batch_size * seq_len)

# and typically much smaller than weight memory. A static estimation based on

# num_parameters is often misleading, so it's omitted here.

# KV cache per token estimate

if num_heads is not None and head_dim is not None:

# Formula: 2 (K/V) * num_layers * num_heads * head_dim * bytes_per_param

# A default of 32 layers is assumed if not provided for the KV cache calculation.

layers_for_kv = num_layers or 32

kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3)

estimates["kv_cache_per_token_gb"] = kv_per_token

# The total estimate here only includes model weights.

# Dynamic components like KV cache and activations must be added separately

# based on runtime parameters (batch size, sequence length).

estimates["total_estimate_gb"] = weight_memory

return estimates

gemini-code-assist · 2025-12-31T10:58:58Z

python/sglang/utils/memory_utils.py

+    torch.cuda.synchronize(gpu_id)
+    props = torch.cuda.get_device_properties(gpu_id)
+    reserved = torch.cuda.memory_reserved(gpu_id)
+    allocated = torch.cuda.memory_allocated(gpu_id)
+    free = props.total_memory - reserved
+
+    return free / (1024**3)


The current implementation to get available GPU memory is a bit indirect and has an unused variable allocated. A simpler and more standard way to get the free GPU memory is to use torch.cuda.mem_get_info(gpu_id). This returns a tuple of (free, total) memory in bytes, where free is what's generally considered available.

Suggested change

torch.cuda.synchronize(gpu_id)

props = torch.cuda.get_device_properties(gpu_id)

reserved = torch.cuda.memory_reserved(gpu_id)

allocated = torch.cuda.memory_allocated(gpu_id)

free = props.total_memory - reserved

return free / (1024**3)

torch.cuda.synchronize(gpu_id)

free_mem, _ = torch.cuda.mem_get_info(gpu_id)

return free_mem / (1024**3)

codebasecomprehension added 2 commits December 31, 2025 16:22

feat: add GPU memory tracking utilities

1d21af2

Add memory_utils module with GPUMemoryTracker class, track_memory context manager, and helper functions for monitoring GPU memory during LLM inference.

test: add tests for memory_utils module

456075e

Add comprehensive unit tests for GPUMemoryTracker, track_memory context manager, and helper functions.

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

codebasecomprehension closed this Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add GPU memory tracking utilities #16224

feat: add GPU memory tracking utilities #16224

codebasecomprehension commented Dec 31, 2025

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add GPU memory tracking utilities #16224

feat: add GPU memory tracking utilities #16224

Conversation

codebasecomprehension commented Dec 31, 2025

Summary

Changes

Motivation

Example Usage

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant