perf: make MSE observer compatible with torch.compile#2384
perf: make MSE observer compatible with torch.compile#2384Bias92 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Summary of ChangesHello @Bias92, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the Mean Squared Error (MSE) observer by making it fully compatible with Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves the performance of the MSE observer by making it compatible with torch.compile. The changes, which include removing the patch_attr context manager and replacing data-dependent control flow with torch.where, are well-executed and result in a remarkable 39x speedup. My review focuses on ensuring the documentation is updated to reflect these changes. Specifically, the docstrings for MemorylessMSEObserver and MovingAverageMSEObserver should be updated to indicate that the patience parameter is no longer used, as early stopping has been removed.
7099621 to
3c2ac54
Compare
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
|
Thanks for your contribution! commentshigh level i'm not sure if we can land with the stategy of removing early stopping. In my tests i remember the early stopping preventing us from searching ~90% of the search space, so in practice i think this will result in a 10x slowdown for the non compiled performance. Given that, your numbers here are a bit off because you're comparing PR to PR whereas the actual number that would be useful is main to PR, if my experience is correct, you've effectively slowed your baseline by 10x so the 35x speedup is really more like 3.5x of what it was before this PR. Thats still great but it seems like it ignores the compile overhead and so we'll need to make sure this is actually a positive change with some e2e testing. I'm also a bit unsure about the end game for this improvement since it just makes the observer compilable but doesn't actually make it compiled. We'd want this to be something a user could disable if they run into trouble as well. a good comparison PR might be this compile optimizations for GPTQ PR as far as how they implement the compilation and the flag. also what benchmark are you running? is this an actual example with real data or is a toy example? I'm not sure what Benchmark (CPU, inductor, shape=(1,1,4096)) means in practice or how i could run it myself. next stepsi think we most likely want the compile and non compiled path to be reasonable and by removing early stopping we're basically destroying the non compiled path. Given that, I think there are 4 strategies we should look at for how compilation could work here, A - bring back early stopping, compile the inner loopwalk back the removal early stopping. instead, extract a compilable inner loop of the code and exclude the data dependent part from compilation. B - bring back early stopping, hide it behind a flag so the whole thing can be compileddo early stopping if its not compiled, but don't do early stopping if it is compiled. I think you can leave data dependent control flow but hide it behind a flag and if i remember correctly, compile will not see the data dependent flow unless the flag changes. may require some finagling to get compile to not try to look and see the data dependent control flow. C - chunked early stoppingwe can partially remove early stopping, we can chunk the grid search into a few chunks and run those and only do a data dependent exit after calculating the mse for the whole chunk. This has the potential to actually be the fastest in terms of compile performance since it gets you both early stopping while still compiling a larger chunk in one go. Dleave as is with no early stopping but do a ton of testing to make sure its bulletproof so we won't have to disable compilation by default because we get a bunch of issues because GPTQ isn't working and then get really terrible performance. I would probably do A by default but could see doing B or C depending on benchmark comparisons between them. These steps will be required regardless
|
|
The quality checks have failed. Please run |
|
Thanks for the detailed feedback! I agree that preserving the non-compiled path performance is important. I'll look into the GPTQ compile PR you referenced and implement strategy A (bring back early stopping, compile the inner loop) as a starting point. I'll also add the enable_observer_compilation flag and run e2e benchmarks with real data. Will update the PR soon. |
|
Hi @Bias92, could you please post your original baseline performance metrics, and post the script you're using to benchmark? I don't fully understand where your speedups are coming from, as you do not actually add a torch.compile decorator anywhere? |
|
"Hi @kylesayrs, good point — I'll share the benchmark script and original baseline metrics. The speedups come from externally wrapping the function with torch.compile(fullgraph=True), which wasn't included in the PR itself. As @HDCharles suggested, I'll add proper compilation functionality with an enable_observer_compilation flag in the next update." |
|
Updated the implementation following the GPTQ PR #2320 pattern:
Non-compiled path performance is now identical to the original baseline. Will work on e2e benchmarks and lint fixes next. |
1 similar comment
|
Updated the implementation following the GPTQ PR #2320 pattern:
Non-compiled path performance is now identical to the original baseline. Will work on e2e benchmarks and lint fixes next. |
E2E Benchmark Results (RTX 4060 Ti, CUDA)TinyLlama-1.1B, FP8 W8A8, MSE observer, 64 calibration samples:
Calibration-only speedup: ~1.15x (5.0s → 4.35s). E2E speedup is 1.09x because model loading + tokenization (~21s) dominates total time. Larger models where calibration is a bigger fraction of total time should see greater benefit. Key findings:
Benchmark script: |
f7448d2 to
6a415e6
Compare
|
Addressed all review feedback:
All 53 tests passing locally (0 failures, 4 skipped). |
|
The quality checks have failed. Please run |
|
can you fix quality issues? |
|
Done — ran |
|
Hi @HDCharles, friendly ping — I've addressed all the feedback from your review: Restored original _grid_search_mse with early stopping (non-compiled baseline unchanged) Would you be able to re-review when you have a moment? Also, CI is pending workflow approval if you could approve that as well. Thanks! |
|
Hi @HDCharles, @dsikka, @kylesayrs — friendly ping! Restored original _grid_search_mse with early stopping (non-compiled path unchanged) The only remaining blocker is workflow approval for CI — once that's approved the checks can actually run. Would appreciate a re-review when you get a chance! |
|
Hey sorry about the wait, reach out to me on vllm slack if i don't respond in ~24 hours |
HDCharles
left a comment
There was a problem hiding this comment.
why is findstr in the PR?
| } | ||
|
|
||
|
|
||
| def compare_scales( |
There was a problem hiding this comment.
don't think this is needed, its unit tested in test_mse
|
this isn't torch.compiling the function? |
|
can you reach out to me on vllm slack there's some additional things to discuss |
0d1c168 to
243c720
Compare
a6bc3fa to
36c30ee
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
compile inner _compute_candidate_error via torch.compile(dynamic=True). early stopping preserved in outer loop. compile flag added as oneshot arg. requires: vllm-project/compressed-tensors#627 related: pytorch/pytorch#177131 Signed-off-by: Jaewoo Kim <pewpewplay315@gmail.com>
9d33973 to
bf63a4c
Compare
make the MSE observer inner loop compatible with torch.compile by extracting _compute_candidate_error as a standalone function compiled with torch.compile ( dynamic=True ). Early stopping is preserved in the outer loop.
compile flag is exposed as a oneshot argument (enable_observer_compile).
e2e benchmark (TinyLlama-1.1B, INT8 W8A8, MSE observer, 64 cal samples, RTX 4060 Ti):
Requires: vllm-project/compressed-tensors#627
Related: pytorch/pytorch#177131
Partial fix for #1485