-
Notifications
You must be signed in to change notification settings - Fork 3.9k
feat: add GPU memory tracking utilities #16224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add GPU memory tracking utilities #16224
Conversation
Add memory_utils module with GPUMemoryTracker class, track_memory context manager, and helper functions for monitoring GPU memory during LLM inference.
Add comprehensive unit tests for GPUMemoryTracker, track_memory context manager, and helper functions.
Summary of ChangesHello @codebasecomprehension, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's capabilities by introducing a robust set of GPU memory management tools. The new utilities enable detailed monitoring, profiling, and estimation of GPU memory usage, which is crucial for optimizing large language model inference, preventing out-of-memory errors, and efficiently planning resource allocation in deployment environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable set of GPU memory utilities. The implementation is a good starting point, but I've identified several critical and high-severity issues that need to be addressed. These include incorrect peak memory reporting in GPUMemoryTracker, a buggy implementation of clear_gpu_memory_cache, and misleading calculations in estimate_model_memory_requirements. I've provided detailed comments and suggestions for each of these points.
Additionally, the accompanying tests are quite basic and primarily check API contracts rather than functional correctness. I strongly recommend enhancing the tests to actually allocate memory and verify that the utilities report and manage it as expected. This will be crucial for ensuring the reliability of these tools.
| current_allocated = torch.cuda.memory_allocated(i) | ||
| current_reserved = torch.cuda.memory_reserved(i) | ||
|
|
||
| stats[i] = { | ||
| 'allocated_gb': current_allocated / (1024**3), | ||
| 'reserved_gb': current_reserved / (1024**3), | ||
| 'peak_allocated_gb': self._peak_allocated[i] / (1024**3), | ||
| 'max_reserved_gb': self._max_reserved[i] / (1024**3), | ||
| 'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stop_tracking method does not update the peak memory usage before returning the statistics. It relies on get_stats being called to update the peak values. If get_stats is not called during the tracking period, the reported peak memory will be incorrect (it will be the same as the initial memory). To fix this, the peak memory should be updated within stop_tracking itself, just before building the stats dictionary.
| current_allocated = torch.cuda.memory_allocated(i) | |
| current_reserved = torch.cuda.memory_reserved(i) | |
| stats[i] = { | |
| 'allocated_gb': current_allocated / (1024**3), | |
| 'reserved_gb': current_reserved / (1024**3), | |
| 'peak_allocated_gb': self._peak_allocated[i] / (1024**3), | |
| 'max_reserved_gb': self._max_reserved[i] / (1024**3), | |
| 'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3), | |
| } | |
| current_allocated = torch.cuda.memory_allocated(i) | |
| current_reserved = torch.cuda.memory_reserved(i) | |
| # Update peak tracking before reporting | |
| self._peak_allocated[i] = max(self._peak_allocated[i], current_allocated) | |
| self._max_reserved[i] = max(self._max_reserved[i], current_reserved) | |
| stats[i] = { | |
| 'allocated_gb': current_allocated / (1024**3), | |
| 'reserved_gb': current_reserved / (1024**3), | |
| 'peak_allocated_gb': self._peak_allocated[i] / (1024**3), | |
| 'max_reserved_gb': self._max_reserved[i] / (1024**3), | |
| 'allocated_delta_gb': (current_allocated - self._initial_allocated[i]) / (1024**3), | |
| } |
| if device is not None: | ||
| if device >= 0 and device < torch.cuda.device_count(): | ||
| torch.cuda.empty_cache() | ||
| else: | ||
| for _ in range(torch.cuda.device_count()): | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clear_gpu_memory_cache function is implemented incorrectly. torch.cuda.empty_cache() operates on the current device. The code does not switch to the specified device when it's provided, and when device is None, it repeatedly calls empty_cache() on the current device instead of iterating over all devices. This can lead to memory not being cleared on the intended GPUs.
| if device is not None: | |
| if device >= 0 and device < torch.cuda.device_count(): | |
| torch.cuda.empty_cache() | |
| else: | |
| for _ in range(torch.cuda.device_count()): | |
| torch.cuda.empty_cache() | |
| if device is not None: | |
| if 0 <= device < torch.cuda.device_count(): | |
| with torch.cuda.device(device): | |
| torch.cuda.empty_cache() | |
| else: | |
| for i in range(torch.cuda.device_count()): | |
| with torch.cuda.device(i): | |
| torch.cuda.empty_cache() |
| # Activation memory estimate (rough approximation: ~2x params per layer for bf16) | ||
| if num_layers is not None: | ||
| activation_memory = (num_parameters * bytes_per_param * 2) / (1024**3) | ||
| estimates["activations_gb"] = activation_memory | ||
|
|
||
| # KV cache per token estimate | ||
| if num_heads is not None and head_dim is not None: | ||
| # 2 * num_layers * num_heads * head_dim * bytes_per_param | ||
| # Assuming default 32 layers if not specified | ||
| layers_for_kv = num_layers or 32 | ||
| kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3) | ||
| estimates["kv_cache_per_token_gb"] = kv_per_token | ||
|
|
||
| # Total estimate | ||
| total = weight_memory | ||
| if "activations_gb" in estimates: | ||
| total += estimates["activations_gb"] | ||
| estimates["total_estimate_gb"] = total | ||
|
|
||
| return estimates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of issues with the memory estimation logic:
- The activation memory estimation (
activation_memory) seems incorrect for inference. The formula(num_parameters * bytes_per_param * 2)is a very rough approximation typically used for training, not inference. For inference, activation memory is proportional tobatch_size * sequence_length * hidden_size * num_layers, not the number of parameters. This leads to a significant overestimation. - The
total_estimate_gbdoes not include the KV cache memory, which is a critical component of memory usage during inference and can often be larger than the weights themselves.
I recommend removing the activations_gb calculation as it's misleading for inference. The total_estimate_gb should also be removed or its calculation clarified in the docstring to note that it excludes dynamic components like KV cache and activations, which depend on runtime factors like batch size and sequence length.
| # Activation memory estimate (rough approximation: ~2x params per layer for bf16) | |
| if num_layers is not None: | |
| activation_memory = (num_parameters * bytes_per_param * 2) / (1024**3) | |
| estimates["activations_gb"] = activation_memory | |
| # KV cache per token estimate | |
| if num_heads is not None and head_dim is not None: | |
| # 2 * num_layers * num_heads * head_dim * bytes_per_param | |
| # Assuming default 32 layers if not specified | |
| layers_for_kv = num_layers or 32 | |
| kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3) | |
| estimates["kv_cache_per_token_gb"] = kv_per_token | |
| # Total estimate | |
| total = weight_memory | |
| if "activations_gb" in estimates: | |
| total += estimates["activations_gb"] | |
| estimates["total_estimate_gb"] = total | |
| return estimates | |
| # Activation memory for inference is dynamic (proportional to batch_size * seq_len) | |
| # and typically much smaller than weight memory. A static estimation based on | |
| # num_parameters is often misleading, so it's omitted here. | |
| # KV cache per token estimate | |
| if num_heads is not None and head_dim is not None: | |
| # Formula: 2 (K/V) * num_layers * num_heads * head_dim * bytes_per_param | |
| # A default of 32 layers is assumed if not provided for the KV cache calculation. | |
| layers_for_kv = num_layers or 32 | |
| kv_per_token = (2 * layers_for_kv * num_heads * head_dim * bytes_per_param) / (1024**3) | |
| estimates["kv_cache_per_token_gb"] = kv_per_token | |
| # The total estimate here only includes model weights. | |
| # Dynamic components like KV cache and activations must be added separately | |
| # based on runtime parameters (batch size, sequence length). | |
| estimates["total_estimate_gb"] = weight_memory | |
| return estimates |
| torch.cuda.synchronize(gpu_id) | ||
| props = torch.cuda.get_device_properties(gpu_id) | ||
| reserved = torch.cuda.memory_reserved(gpu_id) | ||
| allocated = torch.cuda.memory_allocated(gpu_id) | ||
| free = props.total_memory - reserved | ||
|
|
||
| return free / (1024**3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation to get available GPU memory is a bit indirect and has an unused variable allocated. A simpler and more standard way to get the free GPU memory is to use torch.cuda.mem_get_info(gpu_id). This returns a tuple of (free, total) memory in bytes, where free is what's generally considered available.
| torch.cuda.synchronize(gpu_id) | |
| props = torch.cuda.get_device_properties(gpu_id) | |
| reserved = torch.cuda.memory_reserved(gpu_id) | |
| allocated = torch.cuda.memory_allocated(gpu_id) | |
| free = props.total_memory - reserved | |
| return free / (1024**3) | |
| torch.cuda.synchronize(gpu_id) | |
| free_mem, _ = torch.cuda.mem_get_info(gpu_id) | |
| return free_mem / (1024**3) |
Summary
This PR adds a new
memory_utilsmodule with utilities for GPU memory monitoring and optimization in SGLang.Changes
python/sglang/utils/memory_utils.py: New module with:GPUMemoryTracker: Class for tracking memory across multiple GPUstrack_memory(): Context manager for profiling memory usageget_available_gpu_memory(): Query free GPU memoryget_gpu_memory_utilization(): Get memory usage percentageclear_gpu_memory_cache(): Release cached memoryestimate_model_memory_requirements(): Estimate memory needstests/test_memory_utils.py: Unit tests for all functionsMotivation
GPU memory monitoring is essential for:
Example Usage