Skip to content

Latest commit

 

History

History
146 lines (105 loc) · 6.81 KB

015-llm.md

File metadata and controls

146 lines (105 loc) · 6.81 KB

title = "SIP 015 - "Large Language Model Support" template = "main" date = "2023-09-04T23:09:00Z"


Summary: Provide a generic interface for access to large language model resources.

Owner(s): [email protected]

Created: Sept 04, 2023

Background

In 2023, Artificial Intelligence (AI) emerged as one of the most impactful technologies across academic research, venture-backed startups, and large enterprises particularly with the increased interest in large language models (LLMs). The nature of AI and LLM workloads on already trained models lends itself very naturally to a serverless style architecture.

Given this, this document proposes a new feature for Spin, describes the high level feature, and proposes an implementation for the initial release of this feature in Spin.

This document describes how this feature will work out of the box with Spin and some of the caveats of this feature given that laptops and other commodity hardware isn’t always ideal for running LLMs reliably.

Proposal

In order to support llm workloads, the following need to be added to Spin:

  • WIT file that defines the llm interface including APIs for the two most common LLM tasks: inferencing and embeddings.
  • SDK implementations for various programming languages
  • A default lllm implementation that runs cross platform and with minimal set up by the user.

Interface (.wit)

The WIT interface centers around two functions: infer and generate-embeddings. All other types defined in the interface are in service of one or both of these functions.

The infer function is used to generate some text based on a string prompt while the generated-embeddings function is meant to generate the “embeddings” (i.e., numerical representation) of a given list of text inputs.

// An interface dedicated to performing inferencing for Large Language Models.
interface llm {
	/// A Large Language Model.
	type inferencing-model = string

	/// Inference request parameters
	record inferencing-params {
		/// The maximum tokens that should be inferred.
		///
		/// Note: the backing implementation may return less tokens.
		max-tokens: u32,
		/// The amount the model should avoid repeating tokens.
		repeat-penalty: float32,
		/// The number of tokens the model should apply the repeat penalty to.
		repeat-penalty-last-n-token-count: u32,
		/// The randomness with which the next token is selected.
		temperature: float32,
		/// The number of possible next tokens the model will choose from.
		top-k: u32,
		/// The probability total of next tokens the model will choose from.
		top-p: float32
	}

	/// The set of errors which may be raised by functions in this interface
	variant error {
		model-not-supported,
		runtime-error(string),
		invalid-input(string)
	}

	/// An inferencing result
	record inferencing-result {
		/// The text generated by the model
		text: string,
		/// Usage information about the inferencing request
		usage: inferencing-usage
	}

	/// Usage information related to the inferencing result
	record inferencing-usage {
		/// Number of tokens in the prompt
		prompt-token-count: u32,
		/// Number of tokens generated by the inferencing operation
		generated-token-count: u32
	}

	/// Perform inferencing using the provided model and prompt with the given optional params
	infer: func(model: inferencing-model, prompt: string, params: option<inferencing-params>) -> result<inferencing-result, error>

	/// The model used for generating embeddings
	type embedding-model = string

	/// Generate embeddings for the supplied list of text
	generate-embeddings: func(model: embedding-model, text: list<string>) -> result<embeddings-result, error>

	/// Result of generating embeddings
	record embeddings-result {
		/// The embeddings generated by the request
		embeddings: list<list<float32>>,
		/// Usage related to the embeddings generation request
		usage: embeddings-usage
	}

	/// Usage related to an embeddings generation request
	record embeddings-usage {
		/// Number of tokens in the prompt
		prompt-token-count: u32,
	}
}

There are a few things to note about this interface:

  • The inferencing-model and embeddings-model are aliases for strings. This is to maximize future compatibility as users want to support more and different types of models. Below we’ll discuss what models are expected to be supported out of the box by Spin and how the SDKs will help users use these models without having to remember magic strings.
  • The inferencing-result record contains a text field of type string. Ideally this would be a stream that would allow streaming inferencing results back to the user, but alas streaming support is not yet ready for use so we leave that as a possible future backwards incompatible change.

Model Support

The WIT interface has no strong opinions about which models are supported. Any string value may be passed to the infer and generate-embeddings functions, and it is up to the implementation to return error::model-not-supported if the implementation does not support whatever is passed as the model to either of these functions.

Spin will support models by expecting models to be loadable from a well known place on disk relative to the project directory. Models are expected to be located in the .spin/ai-models directory inside of the project directory under the name of the model passed to infer or generated-embeddings.

Inferencing

For example, for inferencing, if the user passes foo-bar-baz as the model to infer Spin will look for a file at the path .spin/ai-models/foo-bar-baz.

To begin with only models with the “llama” architecture will be supported. However, this can be relaxed in the future.

Embeddings

Embeddings models are slightly more complicated in it is expected that both a tokenizer.json and a model.safetensors are located in the directory named after the model. For example, for the foo-bar-baz model, Spin will look in the .spin/ai-models/foo-bar-baz directory for tokenizer.json and a model.safetensors.

spin.toml Changes

In order to work towards the principal that Spin components cannot communicate with the outside world without the user explicitly opting in to such behavior, the user is required to specify which models can be used by a component by adding an ai_models key to their spin.toml:

[[component]]
id = "ai"
source = "ai.wasm"
ai_models = ["llama2-chat"]

The user will receive an error if they use a model that has not been allowed.

Future Possibilities

In the future we may wish to extend this feature in the following ways. Note that this list is not meant to be exhaustive but represents the current thinking about how this feature may evolve over time.

  • More control over where models are expected to live including control of this using the runtime-config feature.
  • Support for more inferencing model architectures beyond “llama”