Rolling window approach #278
Rudra150304
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I was reading AirLLM’s layer streaming approach and had a question:
For cases where the model only moderately exceeds VRAM capacity, has anyone explored using remaining free VRAM as a rolling cache for future layers?
Example: reserve memory for KV cache/activations first, then keep as many upcoming layers resident as possible and slide the window forward during execution.
Intuition: this may reduce transfer stalls compared to minimal layer residency, especially for near-fit models.
Curious if this was tested already or if allocator / bandwidth constraints make it less useful in practice.
Beta Was this translation helpful? Give feedback.
All reactions