title = "SIP 015 - "Large Language Model Support" template = "main" date = "2023-09-04T23:09:00Z"
Summary: Provide a generic interface for access to large language model resources.
Owner(s): [email protected]
Created: Sept 04, 2023
In 2023, Artificial Intelligence (AI) emerged as one of the most impactful technologies across academic research, venture-backed startups, and large enterprises particularly with the increased interest in large language models (LLMs). The nature of AI and LLM workloads on already trained models lends itself very naturally to a serverless style architecture.
Given this, this document proposes a new feature for Spin, describes the high level feature, and proposes an implementation for the initial release of this feature in Spin.
This document describes how this feature will work out of the box with Spin and some of the caveats of this feature given that laptops and other commodity hardware isn’t always ideal for running LLMs reliably.
In order to support llm workloads, the following need to be added to Spin:
- A
WIT
file that defines the llm interface including APIs for the two most common LLM tasks: inferencing and embeddings. - SDK implementations for various programming languages
- A default lllm implementation that runs cross platform and with minimal set up by the user.
The WIT interface centers around two functions: infer
and generate-embeddings
. All other types defined in the interface are in service of one or both of these functions.
The infer
function is used to generate some text based on a string prompt while the generated-embeddings
function is meant to generate the “embeddings” (i.e., numerical representation) of a given list of text inputs.
// An interface dedicated to performing inferencing for Large Language Models.
interface llm {
/// A Large Language Model.
type inferencing-model = string
/// Inference request parameters
record inferencing-params {
/// The maximum tokens that should be inferred.
///
/// Note: the backing implementation may return less tokens.
max-tokens: u32,
/// The amount the model should avoid repeating tokens.
repeat-penalty: float32,
/// The number of tokens the model should apply the repeat penalty to.
repeat-penalty-last-n-token-count: u32,
/// The randomness with which the next token is selected.
temperature: float32,
/// The number of possible next tokens the model will choose from.
top-k: u32,
/// The probability total of next tokens the model will choose from.
top-p: float32
}
/// The set of errors which may be raised by functions in this interface
variant error {
model-not-supported,
runtime-error(string),
invalid-input(string)
}
/// An inferencing result
record inferencing-result {
/// The text generated by the model
text: string,
/// Usage information about the inferencing request
usage: inferencing-usage
}
/// Usage information related to the inferencing result
record inferencing-usage {
/// Number of tokens in the prompt
prompt-token-count: u32,
/// Number of tokens generated by the inferencing operation
generated-token-count: u32
}
/// Perform inferencing using the provided model and prompt with the given optional params
infer: func(model: inferencing-model, prompt: string, params: option<inferencing-params>) -> result<inferencing-result, error>
/// The model used for generating embeddings
type embedding-model = string
/// Generate embeddings for the supplied list of text
generate-embeddings: func(model: embedding-model, text: list<string>) -> result<embeddings-result, error>
/// Result of generating embeddings
record embeddings-result {
/// The embeddings generated by the request
embeddings: list<list<float32>>,
/// Usage related to the embeddings generation request
usage: embeddings-usage
}
/// Usage related to an embeddings generation request
record embeddings-usage {
/// Number of tokens in the prompt
prompt-token-count: u32,
}
}
There are a few things to note about this interface:
- The
inferencing-model
andembeddings-model
are aliases for strings. This is to maximize future compatibility as users want to support more and different types of models. Below we’ll discuss what models are expected to be supported out of the box by Spin and how the SDKs will help users use these models without having to remember magic strings. - The
inferencing-result
record contains a text field of typestring
. Ideally this would be astream
that would allow streaming inferencing results back to the user, but alas streaming support is not yet ready for use so we leave that as a possible future backwards incompatible change.
The WIT interface has no strong opinions about which models are supported. Any string value may be passed to the infer
and generate-embeddings
functions, and it is up to the implementation to return error::model-not-supported
if the implementation does not support whatever is passed as the model to either of these functions.
Spin will support models by expecting models to be loadable from a well known place on disk relative to the project directory. Models are expected to be located in the .spin/ai-models
directory inside of the project directory under the name of the model passed to infer
or generated-embeddings
.
For example, for inferencing, if the user passes foo-bar-baz
as the model to infer
Spin will look for a file at the path .spin/ai-models/foo-bar-baz
.
To begin with only models with the “llama” architecture will be supported. However, this can be relaxed in the future.
Embeddings models are slightly more complicated in it is expected that both a tokenizer.json
and a model.safetensors
are located in the directory named after the model. For example, for the foo-bar-baz
model, Spin will look in the .spin/ai-models/foo-bar-baz
directory for tokenizer.json
and a model.safetensors
.
In order to work towards the principal that Spin components cannot communicate with the outside world without the user explicitly opting in to such behavior, the user is required to specify which models can be used by a component by adding an ai_models
key to their spin.toml
:
[[component]]
id = "ai"
source = "ai.wasm"
ai_models = ["llama2-chat"]
The user will receive an error if they use a model that has not been allowed.
In the future we may wish to extend this feature in the following ways. Note that this list is not meant to be exhaustive but represents the current thinking about how this feature may evolve over time.
- More control over where models are expected to live including control of this using the
runtime-config
feature. - Support for more inferencing model architectures beyond “llama”