Rolling window approach #278

Rudra150304 · 2026-04-19T05:46:06Z

Rudra150304
Apr 19, 2026

I was reading AirLLM’s layer streaming approach and had a question:

For cases where the model only moderately exceeds VRAM capacity, has anyone explored using remaining free VRAM as a rolling cache for future layers?

Example: reserve memory for KV cache/activations first, then keep as many upcoming layers resident as possible and slide the window forward during execution.

Intuition: this may reduce transfer stalls compared to minimal layer residency, especially for near-fit models.

Curious if this was tested already or if allocator / bandwidth constraints make it less useful in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rolling window approach #278

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Rolling window approach #278

Uh oh!

Rudra150304 Apr 19, 2026

Replies: 0 comments

Rudra150304
Apr 19, 2026