Skip to content

Conversation

@codebasecomprehension
Copy link

Summary

This PR adds a new memory_utils module with utilities for GPU memory monitoring and optimization in SGLang.

Changes

  • python/sglang/utils/memory_utils.py: New module with:

    • GPUMemoryTracker: Class for tracking memory across multiple GPUs
    • track_memory(): Context manager for profiling memory usage
    • get_available_gpu_memory(): Query free GPU memory
    • get_gpu_memory_utilization(): Get memory usage percentage
    • clear_gpu_memory_cache(): Release cached memory
    • estimate_model_memory_requirements(): Estimate memory needs
  • tests/test_memory_utils.py: Unit tests for all functions

Motivation

GPU memory monitoring is essential for:

  1. Optimizing batch sizes during inference
  2. Debugging out-of-memory errors
  3. Planning resource allocation for deployments
  4. Profiling performance of different model configurations

Example Usage

from sglang.utils import track_memory, get_gpu_memory_utilization

# Simple usage
with track_memory("batch_inference"):
    results = model.generate(requests)

# Check current utilization
utilization = get_gpu_memory_utilization()
print(f"GPU usage: {utilization:.1f}%")

Add memory_utils module with GPUMemoryTracker class,
track_memory context manager, and helper functions for
monitoring GPU memory during LLM inference.
Add comprehensive unit tests for GPUMemoryTracker,
track_memory context manager, and helper functions.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @codebasecomprehension, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's capabilities by introducing a robust set of GPU memory management tools. The new utilities enable detailed monitoring, profiling, and estimation of GPU memory usage, which is crucial for optimizing large language model inference, preventing out-of-memory errors, and efficiently planning resource allocation in deployment environments.

Highlights

  • New GPU Memory Utilities Module: Introduced python/sglang/utils/memory_utils.py, a new module dedicated to GPU memory monitoring and optimization within SGLang.
  • GPUMemoryTracker Class: Added GPUMemoryTracker, a class designed to track allocated, reserved, and peak memory usage across multiple CUDA devices, providing detailed statistics and summaries.
  • Memory Tracking Context Manager: Implemented track_memory(), a context manager that simplifies profiling GPU memory usage around specific code blocks, automatically reporting delta and peak memory on exit.
  • GPU Memory Helper Functions: Provided utility functions including get_available_gpu_memory() to query free memory, get_gpu_memory_utilization() for percentage usage, and clear_gpu_memory_cache() to release cached GPU memory.
  • Model Memory Estimation: Included estimate_model_memory_requirements(), a function to estimate memory needs for model weights, activations, and KV cache based on parameters and precision.
  • Comprehensive Unit Tests: Added test/test_memory_utils.py with thorough unit tests covering all new classes and functions to ensure correctness and reliability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable set of GPU memory utilities. The implementation is a good starting point, but I've identified several critical and high-severity issues that need to be addressed. These include incorrect peak memory reporting in GPUMemoryTracker, a buggy implementation of clear_gpu_memory_cache, and misleading calculations in estimate_model_memory_requirements. I've provided detailed comments and suggestions for each of these points.

Additionally, the accompanying tests are quite basic and primarily check API contracts rather than functional correctness. I strongly recommend enhancing the tests to actually allocate memory and verify that the utilities report and manage it as expected. This will be crucial for ensuring the reliability of these tools.

Comment on lines +81 to +90
current_allocated = torch.cuda.memory_allocated(i)
current_reserved = torch.cuda.memory_reserved(i)

stats[i] = {
'allocated_gb': current_allocated / (1024**3),
'reserved_gb': current_reserved / (1024**3),
'peak_allocated_gb': self._peak_allocated[i] / (1024**3),
'max_reserved_gb': self._max_reserved[i] / (1024**3),
'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stop_tracking method does not update the peak memory usage before returning the statistics. It relies on get_stats being called to update the peak values. If get_stats is not called during the tracking period, the reported peak memory will be incorrect (it will be the same as the initial memory). To fix this, the peak memory should be updated within stop_tracking itself, just before building the stats dictionary.

Suggested change
current_allocated = torch.cuda.memory_allocated(i)
current_reserved = torch.cuda.memory_reserved(i)
stats[i] = {
'allocated_gb': current_allocated / (1024**3),
'reserved_gb': current_reserved / (1024**3),
'peak_allocated_gb': self._peak_allocated[i] / (1024**3),
'max_reserved_gb': self._max_reserved[i] / (1024**3),
'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3),
}
current_allocated = torch.cuda.memory_allocated(i)
current_reserved = torch.cuda.memory_reserved(i)
# Update peak tracking before reporting
self._peak_allocated[i] = max(self._peak_allocated[i], current_allocated)
self._max_reserved[i] = max(self._max_reserved[i], current_reserved)
stats[i] = {
'allocated_gb': current_allocated / (1024**3),
'reserved_gb': current_reserved / (1024**3),
'peak_allocated_gb': self._peak_allocated[i] / (1024**3),
'max_reserved_gb': self._max_reserved[i] / (1024**3),
'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3),
}

Comment on lines +264 to +269
if device is not None:
if device >= 0 and device < torch.cuda.device_count():
torch.cuda.empty_cache()
else:
for _ in range(torch.cuda.device_count()):
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The clear_gpu_memory_cache function is implemented incorrectly. torch.cuda.empty_cache() operates on the current device. The code does not switch to the specified device when it's provided, and when device is None, it repeatedly calls empty_cache() on the current device instead of iterating over all devices. This can lead to memory not being cleared on the intended GPUs.

Suggested change
if device is not None:
if device >= 0 and device < torch.cuda.device_count():
torch.cuda.empty_cache()
else:
for _ in range(torch.cuda.device_count()):
torch.cuda.empty_cache()
if device is not None:
if 0 <= device < torch.cuda.device_count():
with torch.cuda.device(device):
torch.cuda.empty_cache()
else:
for i in range(torch.cuda.device_count()):
with torch.cuda.device(i):
torch.cuda.empty_cache()

Comment on lines +324 to +343
# Activation memory estimate (rough approximation: ~2x params per layer for bf16)
if num_layers is not None:
activation_memory = (num_parameters * bytes_per_param * 2) / (1024**3)
estimates["activations_gb"] = activation_memory

# KV cache per token estimate
if num_heads is not None and head_dim is not None:
# 2 * num_layers * num_heads * head_dim * bytes_per_param
# Assuming default 32 layers if not specified
layers_for_kv = num_layers or 32
kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3)
estimates["kv_cache_per_token_gb"] = kv_per_token

# Total estimate
total = weight_memory
if "activations_gb" in estimates:
total += estimates["activations_gb"]
estimates["total_estimate_gb"] = total

return estimates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are a couple of issues with the memory estimation logic:

  1. The activation memory estimation (activation_memory) seems incorrect for inference. The formula (num_parameters * bytes_per_param * 2) is a very rough approximation typically used for training, not inference. For inference, activation memory is proportional to batch_size * sequence_length * hidden_size * num_layers, not the number of parameters. This leads to a significant overestimation.
  2. The total_estimate_gb does not include the KV cache memory, which is a critical component of memory usage during inference and can often be larger than the weights themselves.

I recommend removing the activations_gb calculation as it's misleading for inference. The total_estimate_gb should also be removed or its calculation clarified in the docstring to note that it excludes dynamic components like KV cache and activations, which depend on runtime factors like batch size and sequence length.

Suggested change
# Activation memory estimate (rough approximation: ~2x params per layer for bf16)
if num_layers is not None:
activation_memory = (num_parameters * bytes_per_param * 2) / (1024**3)
estimates["activations_gb"] = activation_memory
# KV cache per token estimate
if num_heads is not None and head_dim is not None:
# 2 * num_layers * num_heads * head_dim * bytes_per_param
# Assuming default 32 layers if not specified
layers_for_kv = num_layers or 32
kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3)
estimates["kv_cache_per_token_gb"] = kv_per_token
# Total estimate
total = weight_memory
if "activations_gb" in estimates:
total += estimates["activations_gb"]
estimates["total_estimate_gb"] = total
return estimates
# Activation memory for inference is dynamic (proportional to batch_size * seq_len)
# and typically much smaller than weight memory. A static estimation based on
# num_parameters is often misleading, so it's omitted here.
# KV cache per token estimate
if num_heads is not None and head_dim is not None:
# Formula: 2 (K/V) * num_layers * num_heads * head_dim * bytes_per_param
# A default of 32 layers is assumed if not provided for the KV cache calculation.
layers_for_kv = num_layers or 32
kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3)
estimates["kv_cache_per_token_gb"] = kv_per_token
# The total estimate here only includes model weights.
# Dynamic components like KV cache and activations must be added separately
# based on runtime parameters (batch size, sequence length).
estimates["total_estimate_gb"] = weight_memory
return estimates

Comment on lines +217 to +223
torch.cuda.synchronize(gpu_id)
props = torch.cuda.get_device_properties(gpu_id)
reserved = torch.cuda.memory_reserved(gpu_id)
allocated = torch.cuda.memory_allocated(gpu_id)
free = props.total_memory - reserved

return free / (1024**3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation to get available GPU memory is a bit indirect and has an unused variable allocated. A simpler and more standard way to get the free GPU memory is to use torch.cuda.mem_get_info(gpu_id). This returns a tuple of (free, total) memory in bytes, where free is what's generally considered available.

Suggested change
torch.cuda.synchronize(gpu_id)
props = torch.cuda.get_device_properties(gpu_id)
reserved = torch.cuda.memory_reserved(gpu_id)
allocated = torch.cuda.memory_allocated(gpu_id)
free = props.total_memory - reserved
return free / (1024**3)
torch.cuda.synchronize(gpu_id)
free_mem, _ = torch.cuda.mem_get_info(gpu_id)
return free_mem / (1024**3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant