Skip to content

Feature/intermediates cache prefetch#2392

Open
GOavi101 wants to merge 4 commits intovllm-project:mainfrom
GOavi101:feature/intermediates-cache-prefetch
Open

Feature/intermediates cache prefetch#2392
GOavi101 wants to merge 4 commits intovllm-project:mainfrom
GOavi101:feature/intermediates-cache-prefetch

Conversation

@GOavi101
Copy link
Contributor

@GOavi101 GOavi101 commented Feb 22, 2026

Optional prefetch was added to the intermediates cache and wired into AWQ when offloading.
IntermediatesCache
New method iter_prefetch() iterates over batches like iter() but prefetches the next batch in a background thread so onload from the offload device overlaps with use of the current batch, reducing wall‑clock time when offloading to CPU.
AWQ
When offload_device is set, _run_samples() uses cache.iter_prefetch() instead of the cache iterator so CPU→device onload overlaps with the forward pass over cached parent args during smoothing.
Tests
Two tests were added: one that prefetch yields the same batches as iter(), and one that prefetch on an empty cache yields nothing. No new public API; prefetch is used automatically when AWQ offloads.

Fix: #2374

…loading

- IntermediatesCache.iter_prefetch() overlaps onload of next batch with
  consumption of current batch via a background thread
- AWQ _run_samples uses iter_prefetch when offload_device is set to
  overlap CPU->device transfer with module forward passes
- Add test_iter_prefetch_matches_iter to verify prefetch yields same results as iter

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @GOavi101, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the performance of AWQ (Activation-aware Weight Quantization) offloading by implementing a prefetching mechanism for the intermediates cache. By loading the next batch of data in a background thread while the current batch is being processed, the changes aim to overlap data transfer with computation, thereby reducing overall execution time, especially when offloading to the CPU.

Highlights

  • IntermediatesCache Prefetching: Introduced an optional prefetching mechanism in the IntermediatesCache via a new iter_prefetch() method. This method iterates over batches while asynchronously prefetching the next batch in a background thread, aiming to reduce wall-clock time during offloading.
  • AWQ Offloading Integration: Integrated the new iter_prefetch() method into the AWQ _run_samples() function. When an offload_device is specified, AWQ now utilizes prefetching to overlap CPU-to-device data loading with the forward pass over cached parent arguments, improving performance during smoothing.
  • New Tests for Prefetching: Added two new unit tests to validate the iter_prefetch() functionality: one to ensure it yields nothing for an empty cache and another to confirm it produces the same batch contents as the standard iter() method.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/modifiers/awq/base.py
    • Modified _run_samples to conditionally use cache.iter_prefetch() when self.offload_device is set, otherwise defaults to the standard cache iterator.
  • src/llmcompressor/pipelines/cache.py
    • Imported ThreadPoolExecutor for background task management.
    • Added iter_prefetch method to IntermediatesCache to enable asynchronous prefetching of batches.
  • tests/llmcompressor/pipelines/test_cache.py
    • Added test_iter_prefetch_empty_cache to verify behavior with an empty cache.
    • Added test_iter_prefetch_matches_iter to confirm functional equivalence with the standard iterator.
    • Relocated the deep_equal helper function to support the new prefetching tests.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optional prefetching mechanism to the IntermediatesCache to optimize performance by overlapping CPU-to-device onload with the forward pass, particularly beneficial for AWQ when offloading to CPU. The new iter_prefetch() method in IntermediatesCache uses a background thread to prefetch the next batch, and the _run_samples() method in AWQModifier has been updated to leverage this feature conditionally. New unit tests have been added to ensure the correctness of the iter_prefetch() method, covering both empty cache scenarios and ensuring consistency with the regular iter() method. The changes are well-implemented and directly address the stated goal of reducing wall-clock time.

module(**batch_kwargs) for batch_kwargs in self._parent_args_cache[module]
]
cache = self._parent_args_cache[module]
# When offloading, prefetch overlaps CPU->device onload with forward pass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on this line exceeds the recommended line length of 79 characters, as per PEP 8. Please consider rephrasing or splitting it into multiple lines for better readability.

Suggested change
# When offloading, prefetch overlaps CPU->device onload with forward pass.
# Prefetching overlaps CPU->device onload with forward pass when offloading.

@HDCharles
Copy link
Collaborator

HDCharles commented Feb 28, 2026

Looks good, probably want the same gating as the other pr and also probably want to unify the two to use the same util and make the tests be specific to the shared functionality.

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment

@mergify
Copy link
Contributor

mergify bot commented Feb 28, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

…loading

- IntermediatesCache.iter_prefetch() overlaps onload of next batch with
  consumption of current batch via a background thread
- AWQ _run_samples uses iter_prefetch when offload_device is set to
  overlap CPU->device transfer with module forward passes
- Add test_iter_prefetch_matches_iter to verify prefetch yields same results as iter

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
@GOavi101 GOavi101 force-pushed the feature/intermediates-cache-prefetch branch from 457af3e to 9128c2b Compare March 5, 2026 18:06
@GOavi101 GOavi101 requested a review from HDCharles March 5, 2026 18:06
@mergify mergify bot removed the quality-failed label Mar 5, 2026
assert batch_dicts_equal(b_iter, b_prefetch), f"batch {i} differs"


def deep_equal(a, b) -> bool:
Copy link
Collaborator

@HDCharles HDCharles Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this moved? better to leave it where it was

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still moved and adds a bunch of line changes that will alter the blame and history for no reason

@GOavi101 GOavi101 force-pushed the feature/intermediates-cache-prefetch branch from e125440 to 9128c2b Compare March 10, 2026 09:26
@mergify
Copy link
Contributor

mergify bot commented Mar 10, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

…tate

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
@GOavi101 GOavi101 force-pushed the feature/intermediates-cache-prefetch branch from fe3d424 to 595fd90 Compare March 10, 2026 09:27
@mergify mergify bot removed the quality-failed label Mar 10, 2026
@GOavi101 GOavi101 requested a review from HDCharles March 10, 2026 09:27
@HDCharles HDCharles added the ready When a PR is ready for review label Mar 10, 2026
]


def deep_equal(a, b) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once you get rid of these line changes, we'll be good to land

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to still fix the deep equal thing, otherwise this looks good!

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speedup from these changes needs to be tested before this can be merged. There are a lot of ways in which this can look good but due to implementation details not achieve the desired outcome.

I'm also not 100% sure I understand the theory here. According to the torch.Tensor.to documentation:

In general, the transfer is blocking on the device side (even if it isn’t on the host side): the copy on the device cannot occur while another operation is being executed

https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html#asynchronous-vs-synchronous-operations-with-non-blocking-true-cuda-cudamemcpyasync

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding this. I have one nit on variable naming convention, otherwise LGTM

else:
session.state.loss_masks = None

use_prefetch = getattr(dataset_args, "sequential_prefetch", False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -- I prefer we use a single config var name for this for consistency

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just needs validation before landing.

We should also consider

  1. Having a dedicated "prefetch" thread
  2. Prefetching multiple samples (lookahead 5)
  3. Whether this can be accomplished using torch.Tensor.to(non_blocking=True)
  4. Whether a similar technique can be done when storing activations (similar to non_blocking=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add prefetch to intermediates cache

4 participants