Feature/intermediates cache prefetch by GOavi101 · Pull Request #2392 · vllm-project/llm-compressor

GOavi101 · 2026-02-22T12:54:02Z

Optional prefetch was added to the intermediates cache and wired into AWQ when offloading.
IntermediatesCache
New method iter_prefetch() iterates over batches like iter() but prefetches the next batch in a background thread so onload from the offload device overlaps with use of the current batch, reducing wall‑clock time when offloading to CPU.
AWQ
When offload_device is set, _run_samples() uses cache.iter_prefetch() instead of the cache iterator so CPU→device onload overlaps with the forward pass over cached parent args during smoothing.
Tests
Two tests were added: one that prefetch yields the same batches as iter(), and one that prefetch on an empty cache yields nothing. No new public API; prefetch is used automatically when AWQ offloads.

Fix: #2374

…loading - IntermediatesCache.iter_prefetch() overlaps onload of next batch with consumption of current batch via a background thread - AWQ _run_samples uses iter_prefetch when offload_device is set to overlap CPU->device transfer with module forward passes - Add test_iter_prefetch_matches_iter to verify prefetch yields same results as iter Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

github-actions · 2026-02-22T12:54:11Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-22T12:54:17Z

Summary of Changes

Hello @GOavi101, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the performance of AWQ (Activation-aware Weight Quantization) offloading by implementing a prefetching mechanism for the intermediates cache. By loading the next batch of data in a background thread while the current batch is being processed, the changes aim to overlap data transfer with computation, thereby reducing overall execution time, especially when offloading to the CPU.

Highlights

IntermediatesCache Prefetching: Introduced an optional prefetching mechanism in the IntermediatesCache via a new iter_prefetch() method. This method iterates over batches while asynchronously prefetching the next batch in a background thread, aiming to reduce wall-clock time during offloading.
AWQ Offloading Integration: Integrated the new iter_prefetch() method into the AWQ _run_samples() function. When an offload_device is specified, AWQ now utilizes prefetching to overlap CPU-to-device data loading with the forward pass over cached parent arguments, improving performance during smoothing.
New Tests for Prefetching: Added two new unit tests to validate the iter_prefetch() functionality: one to ensure it yields nothing for an empty cache and another to confirm it produces the same batch contents as the standard iter() method.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/modifiers/awq/base.py
- Modified _run_samples to conditionally use cache.iter_prefetch() when self.offload_device is set, otherwise defaults to the standard cache iterator.
src/llmcompressor/pipelines/cache.py
- Imported ThreadPoolExecutor for background task management.
- Added iter_prefetch method to IntermediatesCache to enable asynchronous prefetching of batches.
tests/llmcompressor/pipelines/test_cache.py
- Added test_iter_prefetch_empty_cache to verify behavior with an empty cache.
- Added test_iter_prefetch_matches_iter to confirm functional equivalence with the standard iterator.
- Relocated the deep_equal helper function to support the new prefetching tests.

Activity

No human activity has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an optional prefetching mechanism to the IntermediatesCache to optimize performance by overlapping CPU-to-device onload with the forward pass, particularly beneficial for AWQ when offloading to CPU. The new iter_prefetch() method in IntermediatesCache uses a background thread to prefetch the next batch, and the _run_samples() method in AWQModifier has been updated to leverage this feature conditionally. New unit tests have been added to ensure the correctness of the iter_prefetch() method, covering both empty cache scenarios and ensuring consistency with the regular iter() method. The changes are well-implemented and directly address the stated goal of reducing wall-clock time.

gemini-code-assist · 2026-02-22T12:55:44Z

src/llmcompressor/modifiers/awq/base.py

-            module(**batch_kwargs) for batch_kwargs in self._parent_args_cache[module]
-        ]
+        cache = self._parent_args_cache[module]
+        # When offloading, prefetch overlaps CPU->device onload with forward pass.


The comment on this line exceeds the recommended line length of 79 characters, as per PEP 8. Please consider rephrasing or splitting it into multiple lines for better readability.

Suggested change

# When offloading, prefetch overlaps CPU->device onload with forward pass.

# Prefetching overlaps CPU->device onload with forward pass when offloading.

HDCharles · 2026-02-28T14:10:53Z

Looks good, probably want the same gating as the other pr and also probably want to unify the two to use the same util and make the tests be specific to the shared functionality.

HDCharles

see comment

mergify · 2026-02-28T14:16:11Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

…loading - IntermediatesCache.iter_prefetch() overlaps onload of next batch with consumption of current batch via a background thread - AWQ _run_samples uses iter_prefetch when offload_device is set to overlap CPU->device transfer with module forward passes - Add test_iter_prefetch_matches_iter to verify prefetch yields same results as iter Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

HDCharles · 2026-03-05T19:28:51Z

tests/llmcompressor/pipelines/test_cache.py

+        assert batch_dicts_equal(b_iter, b_prefetch), f"batch {i} differs"
+
+
+def deep_equal(a, b) -> bool:


why is this moved? better to leave it where it was

This is still moved and adds a bunch of line changes that will alter the blame and history for no reason

src/llmcompressor/modifiers/awq/base.py

src/llmcompressor/pipelines/sequential/pipeline.py

mergify · 2026-03-10T09:26:59Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

…tate Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

HDCharles · 2026-03-10T13:53:02Z

tests/llmcompressor/pipelines/test_cache.py

 ]


+def deep_equal(a, b) -> bool:


once you get rid of these line changes, we'll be good to land

HDCharles

need to still fix the deep equal thing, otherwise this looks good!

kylesayrs

The speedup from these changes needs to be tested before this can be merged. There are a lot of ways in which this can look good but due to implementation details not achieve the desired outcome.

I'm also not 100% sure I understand the theory here. According to the torch.Tensor.to documentation:

In general, the transfer is blocking on the device side (even if it isn’t on the host side): the copy on the device cannot occur while another operation is being executed

https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html#asynchronous-vs-synchronous-operations-with-non-blocking-true-cuda-cudamemcpyasync

brian-dellabetta

Looks great! Thanks for adding this. I have one nit on variable naming convention, otherwise LGTM

brian-dellabetta · 2026-03-10T15:02:08Z

src/llmcompressor/pipelines/sequential/pipeline.py

            else:
                session.state.loss_masks = None

+            use_prefetch = getattr(dataset_args, "sequential_prefetch", False)


nit -- I prefer we use a single config var name for this for consistency

kylesayrs

I think this just needs validation before landing.

We should also consider

Having a dedicated "prefetch" thread
Prefetching multiple samples (lookahead 5)
Whether this can be accomplished using torch.Tensor.to(non_blocking=True)
Whether a similar technique can be done when storing activations (similar to non_blocking=True)

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

brian-dellabetta requested a review from HDCharles February 23, 2026 16:39

HDCharles requested changes Feb 28, 2026

View reviewed changes

mergify bot added the quality-failed label Feb 28, 2026

GOavi101 force-pushed the feature/intermediates-cache-prefetch branch from 457af3e to 9128c2b Compare March 5, 2026 18:06

GOavi101 requested a review from HDCharles March 5, 2026 18:06

mergify bot removed the quality-failed label Mar 5, 2026

HDCharles reviewed Mar 5, 2026

View reviewed changes

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

HDCharles reviewed Mar 5, 2026

View reviewed changes

src/llmcompressor/pipelines/sequential/pipeline.py Show resolved Hide resolved

GOavi101 force-pushed the feature/intermediates-cache-prefetch branch from e125440 to 9128c2b Compare March 10, 2026 09:26

mergify bot added the quality-failed label Mar 10, 2026

Address review: move deep_equal, make prefetch a global setting via S…

595fd90

…tate Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

GOavi101 force-pushed the feature/intermediates-cache-prefetch branch from fe3d424 to 595fd90 Compare March 10, 2026 09:27

mergify bot removed the quality-failed label Mar 10, 2026

GOavi101 requested a review from HDCharles March 10, 2026 09:27

Merge branch 'main' into feature/intermediates-cache-prefetch

db96320

HDCharles added the ready When a PR is ready for review label Mar 10, 2026

HDCharles reviewed Mar 10, 2026

View reviewed changes

HDCharles requested changes Mar 10, 2026

View reviewed changes

HDCharles requested review from brian-dellabetta, dsikka and kylesayrs March 10, 2026 13:56

kylesayrs requested changes Mar 10, 2026

View reviewed changes

brian-dellabetta reviewed Mar 10, 2026

View reviewed changes

kylesayrs reviewed Mar 10, 2026

View reviewed changes

	# When offloading, prefetch overlaps CPU->device onload with forward pass.
	# Prefetching overlaps CPU->device onload with forward pass when offloading.

		assert batch_dicts_equal(b_iter, b_prefetch), f"batch {i} differs"


		def deep_equal(a, b) -> bool:

Conversation

GOavi101 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

HDCharles commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 28, 2026

Uh oh!

HDCharles Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GOavi101 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

HDCharles Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

HDCharles Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GOavi101 commented Feb 22, 2026 •

edited

Loading

HDCharles commented Feb 28, 2026 •

edited

Loading

HDCharles Mar 5, 2026 •

edited

Loading