description
Instructions for using models hosted on a filesystem with Spice.

Filesystem Hosted Models

To use a model hosted on a filesystem, specify the path to the model file or folder in the from field:

models:
  - from: file://models/llms/llama3.2-1b-instruct/
    name: llama3
    params:
      model_type: llama

Supported formats include GGUF, GGML, and SafeTensor for large language models (LLMs) and ONNX for traditional machine learning (ML) models.

Configuration

`from`

An absolute or relative path to the model file or folder:

from: file://absolute/path/models/llms/llama3.2-1b-instruct/
from: file:models/llms/llama3.2-1b-instruct/

`params` (optional)

Param	Description
`model_type`	The architecture to load the model as. Supported values: `mistral`, `gemma`, `mixtral`, `llama`, `phi2`, `phi3`, `qwen2`, `gemma2`, `starcoder2`, `phi3.5moe`, `deepseekv2`, `deepseek`
`tools`	Which tools should be made available to the model. Set to `auto` to use all available tools.
`system_prompt`	An additional system prompt used for all chat completions to this model.
`chat_template`	Customizes the transformation of OpenAI chat messages into a character stream for the model. See Overriding the Chat Template.

See Large Language Models for additional configuration options.

Tools
Memory
Evals
Parameter overrides

`files` (optional)

The files field specifies additional files required by the model, such as tokenizer, configuration, and other files.

- name: local-model
  from: file://models/llms/llama3.2-1b-instruct/model.safetensors
  files:
    - path: //models/llms/llama3.2-1b-instruct/tokenizer.json
    - path: //models/llms/llama3.2-1b-instruct/tokenizer_config.json
    - path: //models/llms/llama3.2-1b-instruct/config.json

Examples

Loading a GGML Model

models:
  - from: file://absolute/path/to/my/model.ggml
    name: local_ggml_model
    files:
      - path: models/llms/ggml/tokenizer.json
      - path: models/llms/ggml/tokenizer_config.json
      - path: models/llms/ggml/config.json

Example: Loading a SafeTensor Model

models:
  - name: safety
    from: file:models/llms/llama3.2-1b-instruct/model.safetensors
    files:
      - path: models/llms/llama3.2-1b-instruct/tokenizer.json
      - path: models/llms/llama3.2-1b-instruct/tokenizer_config.json
      - path: models/llms/llama3.2-1b-instruct/config.json

Loading LLM from a directory

models:
  - name: llama3
    from: file:models/llms/llama3.2-1b-instruct/

Note: The folder provided should contain all the expected files (see examples above).

Loading an ONNX Model

models:
  - from: file://absolute/path/to/my/model.onnx
    name: local_fs_model

Loading a GGUF Model

models:
  - from: file://absolute/path/to/my/model.gguf
    name: local_gguf_model

Overriding the Chat Template

Chat templates convert the OpenAI compatible chat messages (see format) and other components of a request into a stream of characters for the language model. It follows Jinja3 templating syntax.

Further details on chat templates can be found here.

models:
  - name: local_model
    from: file:path/to/my/model.gguf
    params:
      chat_template: |
        {% set loop_messages = messages %}
        {% for message in loop_messages %}
          {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
          {{ content }}
        {% endfor %}
        {% if add_generation_prompt %}
          {{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
        {% endif %}

Templating Variables

messages: List of chat messages, in the OpenAI format.
add_generation_prompt: Boolean flag whether to add a generation prompt.
tools: List of callable tools, in the OpenAI format.

{% hint style="warning" %}

Limitations

The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.

{% endhint %}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!