Description
Currently MLXLMCommon
has some basic support for a cache, however it isn't persisted across calls to generate()
.
Even though it appears there could be a way to pass a KVCache
to generate()
, it ultimately must pass through the Sendable boundary if the app is to manage the cache. This isn't possible as MLXArray
is not Sendable and also isn't desirable or necessary.
A prompt cache could be managed by the ModelContainer
actor and stored in its context ModelContext.promptCache
. Note that the prompt cache is an array of KVCache
. In mlx_lm
the PromptCache
object also stores the token ids of the cached prompt and the model key to check if the model has changed.
We could implement a similar struct:
public struct PromptCache {
public let cache: [KVCache]
public let modelKey: String
public let tokens: MLXArray
}
The PromptCache
struct could also have functions for trimming.
Functions analogous to mlx_lm
's get_prompt_cache
could go in the ModelContainer
actor.
I'm currently having a go at implementing this. Interested in any suggestions on the best approach.