- Porting and implementing models
- MLXLLMCommon -- common API for LLM and VLM
- MLXLLM -- large language model example implementations
- MLXVLM -- vision language model example implementations
MLXLMCommon contains types and code that is generic across many types of language models, from LLMs to VLMs:
- Evaluation
- KVCache
- Loading
- UserInput
A model is typically loaded by using a ModelFactory
and a ModelConfiguration
:
// e.g. VLMModelFactory.shared
let modelFactory: ModelFactory
// e.g. VLMRegistry.paligemma3bMix4488bit
let modelConfiguration: ModelConfiguration
let container = try await modelFactory.loadContainer(configuration: modelConfiguration)
The container
provides an isolation context (an actor
) to run inference in the model.
Predefined ModelConfiguration
instances are provided as static variables
on the ModelRegistry
types or they can be created:
let modelConfiguration = ModelConfiguration(id: "mlx-community/paligemma-3b-mix-448-8bit")
The flow inside the ModelFactory
goes like this:
public class VLMModelFactory: ModelFactory {
public func _load(
hub: HubApi, configuration: ModelConfiguration,
progressHandler: @Sendable @escaping (Progress) -> Void
) async throws -> ModelContext {
// download the weight and config using HubApi
// load the base configuration
// using the typeRegistry create a model (random weights)
// load the weights, apply quantization as needed, update the model
// calls model.sanitize() for weight preparation
// load the tokenizer
// (vlm) load the processor configuration, create the processor
}
}
Callers with specialized requirements can use these individual components to manually load models, if needed.
- Load the Model
- UserInput
- LMInput
- generate()
- NaiveStreamingDetokenizer
- TokenIterator
Once a model is loaded you can evaluate a prompt or series of messages. Minimally you need to prepare the user input:
let prompt = "Describe the image in English"
var input = UserInput(prompt: prompt, images: image.map { .url($0) })
input.processing.resize = .init(width: 256, height: 256)
This example shows adding some images and processing instructions -- if model accepts text only then these parts can be omitted. The inference calls are the same.
Assuming you are using a ModelContainer
(an actor that holds
a ModelContext
, which is the bundled set of types that implement a
model), the first step is to convert the UserInput
into the
LMInput
(LanguageModel Input):
let generateParameters: GenerateParameters
let input: UserInput
let result = try await modelContainer.perform { [input] context in
let input = try context.processor.prepare(input: input)
Given that input
we can call generate()
to produce a stream
of tokens. In this example we use a NaiveStreamingDetokenizer
to assist in converting a stream of tokens into text and print it.
The stream is stopped after we hit a maximum number of tokens:
var detokenizer = NaiveStreamingDetokenizer(tokenizer: context.tokenizer)
return try MLXLMCommon.generate(
input: input, parameters: generateParameters, context: context
) { tokens in
if let last = tokens.last {
detokenizer.append(token: last)
}
if let new = detokenizer.next() {
print(new, terminator: "")
fflush(stdout)
}
if tokens.count >= maxTokens {
return .stop
} else {
return .more
}
}
}