Skip to content

Fix memory calculation causing --compile to error#756

Merged
ajtejankar merged 3 commits intomainfrom
compile-mem-calc
Feb 6, 2025
Merged

Fix memory calculation causing --compile to error#756
ajtejankar merged 3 commits intomainfrom
compile-mem-calc

Conversation

@ajtejankar
Copy link
Contributor

The exact reason of the error is not clear to me, but there's a simpler way to calculate the free memory which is implemented here. Here's what I've understood so far.

  1. When compile is enabled, the code path that estimates cuda graph memory causes PyTorch to release temporary memory which increases the amount of free memory available
  2. This doesn't happen when compile is not enabled which results in lesser free memory
  3. My guess is that free memory is underestimated in without compile case but is slightly overestimated in with compile case
  4. We need to do two things: align the memory estimation between two cases and correctly estimate the memory in both cases
  5. We can align the memory estimation in both cases by adding a call to empty_cache
  6. We can make the memory estimation accurate by getting rid of the batch kv cache by setting self.kv_cache = [] before empty_cache call and by not adding batch_num_blocks to num_blocks.

@ajtejankar
Copy link
Contributor Author

There's one more problem that the previous commit 03726e7 didn't fix. The mdoel_graph_wrapper object contains a reference to the kv_cache memory which prevents torch from freeing it up when empty_cache is called. Hence, we use model_graph_wrapper to estimate cuda graph memory overhead, delete it, and then re-instantiate one final time to actually warmup the graphs.

Copy link
Contributor

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, added a couple comments.

Copy link
Contributor

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ajtejankar ajtejankar merged commit acaa217 into main Feb 6, 2025
1 check passed
@ajtejankar ajtejankar deleted the compile-mem-calc branch February 6, 2025 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants