onnx · jeremyfowers · May 2, 2025 · May 2, 2025 · May 2, 2025
diff --git a/.github/workflows/test_server.yml b/.github/workflows/test_server.yml
@@ -36,10 +36,14 @@ jobs:
           python -m pip install --upgrade pip
           python -m pip check
           pip install -e .[llm-oga-cpu]
-          lemonade-install --model Qwen2.5-0.5B-Instruct-CPU
-      - name: Run server tests
+          lemonade-server-dev pull Qwen2.5-0.5B-Instruct-CPU
+      - name: Run server tests (network online mode)
         shell: bash -el {0}
         run: |
           python test/lemonade/server.py
+      - name: Run server tests (offline mode)
+        shell: bash -el {0}
+        run: |
+          python test/lemonade/server.py --offline
 
 
diff --git a/docs/lemonade/server_integration.md b/docs/lemonade/server_integration.md
@@ -70,6 +70,17 @@ https://github.com/onnx/turnkeyml/releases/download/v6.0.0/Lemonade_Server_Insta
 
 Please note that the Server Installer is only available on Windows. Apps that integrate with our server on a Linux machine must install Lemonade from source as described [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md#from-source-code).
 
+### Installing Additional Models
+
+Lemonade Server installations always come with at least one LLM installed. If you want to install additional models on behalf of your users, the following tools are available:
+
+- Discovering which LLMs are available:
+  - [A human-readable list of supported models](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md).
+  - [A JSON file with the list of supported models](https://github.com/onnx/turnkeyml/tree/main/src/lemonade_server/server_models.json) is included in every Lemonade Server installation.
+- Installing LLMs:
+  - [The `pull` endpoint in the server](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md#get-apiv0pull-).
+  - `lemonade-server pull MODEL` on the command line interface.
+
 ## Stand-Alone Server Integration
 
 Some apps might prefer to be responsible for installing and managing Lemonade Server on behalf of the user. This part of the guide includes steps for installing and running Lemonade Server so that your users don't have to install Lemonade Server separately.
@@ -94,6 +105,8 @@ lemonade-server --port 8123
 
 You can also run the server as a background process using a subprocess or any preferred method.
 
+To stop the server, you may use the `lemonade-server stop` command, or simply terminate the process you created by keeping track of its PID. Please do not run the `lemonade-server stop` command if your application has not started the server, as the server may be used by other applications.
+
 ### Silent Installation
 
 Silent installation runs `Lemonade_Server_Installer.exe` without a GUI and automatically accepts all prompts.

diff --git a/docs/lemonade/server_models.md b/docs/lemonade/server_models.md
@@ -6,17 +6,22 @@ This document provides the models we recommend for use with Lemonade Server. Cli
 ## Naming Convention
 The format of each Lemonade name is a combination of the name in the base checkpoint and the backend where the model will run. So, if the base checkpoint is `meta-llama/Llama-3.2-1B-Instruct`, and it has been optimized to run on Hybrid, the resulting name is Llama-3.2-3B-Instruct-Hybrid.
 
-## Supported Models
-<details>
-<summary>Qwen2.5-0.5B-Instruct-CPU</summary>
+## Installing Additional Models
 
-| Key | Value |
-| --- | ----- |
-| Checkpoint | [amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx](https://huggingface.co/amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx) |
-| Recipe | oga-cpu |
-| Reasoning | False |
+Once you've installed Lemonade Server, you can install any model on this list using the `lemonade-server pull` command. 
 
-</details>
+Example:
+
+```bash
+lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
+```
+
+> Note: `lemonade-server` is a utility that is added to your PATH when you install Lemonade Server with the GUI installer.
+> If you are using Lemonade Server from a Python environment, use the `lemonade-server-dev pull` command instead.
+
+## Supported Models
+
+### Hybrid
 
 <details>
 <summary>Llama-3.2-1B-Instruct-Hybrid</summary>
@@ -84,3 +89,105 @@ The format of each Lemonade name is a combination of the name in the base checkp
 
 </details>
 
+<details>
+<summary>Mistral-7B-v0.3-Instruct-Hybrid</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid) |
+| Recipe | oga-hybrid |
+| Reasoning | False |
+
+</details>
+
+<details>
+<summary>Llama-3.1-8B-Instruct-Hybrid</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Llama-3.1-8B-Instruct-awq-asym-uint4-g128-lmhead-onnx-hybrid](https://huggingface.co/amd/Llama-3.1-8B-Instruct-awq-asym-uint4-g128-lmhead-onnx-hybrid) |
+| Recipe | oga-hybrid |
+| Reasoning | False |
+
+</details>
+
+
+### CPU
+
+<details>
+<summary>Qwen2.5-0.5B-Instruct-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx](https://huggingface.co/amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx) |
+| Recipe | oga-cpu |
+| Reasoning | False |
+
+</details>
+
+<details>
+<summary>Llama-3.2-1B-Instruct-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Llama-3.2-1B-Instruct-awq-uint4-float16-cpu-onnx](https://huggingface.co/amd/Llama-3.2-1B-Instruct-awq-uint4-float16-cpu-onnx) |
+| Recipe | oga-cpu |
+| Reasoning | False |
+
+</details>
+
+<details>
+<summary>Llama-3.2-3B-Instruct-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Llama-3.2-3B-Instruct-awq-uint4-float16-cpu-onnx](https://huggingface.co/amd/Llama-3.2-3B-Instruct-awq-uint4-float16-cpu-onnx) |
+| Recipe | oga-cpu |
+| Reasoning | False |
+
+</details>
+
+<details>
+<summary>Phi-3-Mini-Instruct-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Phi-3-mini-4k-instruct_int4_float16_onnx_cpu](https://huggingface.co/amd/Phi-3-mini-4k-instruct_int4_float16_onnx_cpu) |
+| Recipe | oga-cpu |
+| Reasoning | False |
+
+</details>
+
+<details>
+<summary>Qwen-1.5-7B-Chat-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/Qwen1.5-7B-Chat_uint4_asym_g128_float16_onnx_cpu](https://huggingface.co/amd/Qwen1.5-7B-Chat_uint4_asym_g128_float16_onnx_cpu) |
+| Recipe | oga-cpu |
+| Reasoning | False |
+
+</details>
+
+<details>
+<summary>DeepSeek-R1-Distill-Llama-8B-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu) |
+| Recipe | oga-cpu |
+| Reasoning | True |
+
+</details>
+
+<details>
+<summary>DeepSeek-R1-Distill-Qwen-7B-CPU</summary>
+
+| Key | Value |
+| --- | ----- |
+| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu) |
+| Recipe | oga-cpu |
+| Reasoning | True |
+
+</details>
+
diff --git a/docs/lemonade/server_spec.md b/docs/lemonade/server_spec.md
@@ -23,6 +23,7 @@ They focus on enabling client applications by extending existing cloud-focused A
 - Unload models to save memory space.
 
 The additional endpoints under development are:
+- POST `/api/v0/pull` - Install a model
 - POST `/api/v0/load` - Load a model
 - POST `/api/v0/unload` - Unload a model
 - POST `/api/v0/params` - Set generation parameters
@@ -250,9 +251,40 @@ curl http://localhost:8000/api/v0/models
 
 ## Additional Endpoints
 
+### `GET /api/v0/pull` <sub>![Status](https://img.shields.io/badge/status-fully_available-green)</sub>
+
+Install a model by downloading it and registering it with Lemonade Server.
+
+#### Parameters
+
+| Parameter | Required | Description |
+|-----------|----------|-------------|
+| `model_name` | Yes | [Lemonade Server model name](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md) to load. |
+
+Example request:
+
+```bash
+curl http://localhost:8000/api/v0/pull \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model_name": "Qwen2.5-0.5B-Instruct-CPU"
+  }'
+```
+
+Response format:
+
+```json
+{
+  "status":"success",
+  "message":"Installed model: Qwen2.5-0.5B-Instruct-CPU"
+}
+```
+
+In case of an error, the status will be `error` and the message will contain the error message.
+
 ### `GET /api/v0/load` <sub>![Status](https://img.shields.io/badge/status-fully_available-green)</sub>
 
-Explicitly load a model into memory. This is useful to ensure that the model is loaded before you make a request.
+Explicitly load a model into memory. This is useful to ensure that the model is loaded before you make a request. Installs the model if necessary.
 
 #### Parameters
 
@@ -321,7 +353,6 @@ Response format:
 
 In case of an error, the status will be `error` and the message will contain the error message.
 
-
 ### `POST /api/v0/unload` <sub>![Status](https://img.shields.io/badge/status-partially_available-red)</sub>
 
 Explicitly unload a model from memory. This is useful to free up memory while still leaving the server process running (which takes minimal resources but a few seconds to start).

diff --git a/examples/lemonade/server/README.md b/examples/lemonade/server/README.md
@@ -12,6 +12,7 @@ This allows the same application to leverage local LLMs instead of relying on Op
 | [Continue](https://www.continue.dev/)   | [How to use Lemonade LLMs as a coding assistant in Continue](continue.md)                                          | _coming soon_                                          |
 | [Microsoft AI Toolkit](https://learn.microsoft.com/en-us/windows/ai/toolkit/)   | [Experimenting with Lemonade LLMs in VS Code using Microsoft's AI Toolkit](ai-toolkit.md)                                          | _coming soon_                                        |
 | [CodeGPT](https://codegpt.co/)   | [How to use Lemonade LLMs as a coding assistant in CodeGPT](codeGPT.md)                                          | _coming soon_                                           |
+[MindCraft](mindcraft.md) | [How to use Lemonade LLMs as a Minecraft agent](mindcraft.md) | _coming soon_                                           |
 | [wut](https://github.com/shobrook/wut)   | [Terminal assistant that uses Lemonade LLMs to explain errors](wut.md)                                          | _coming soon_                                           |
 | [AnythingLLM](https://anythingllm.com/) | [Running agents locally with Lemonade and AnythingLLM](anythingLLM.md) | _coming soon_                                          | 
 | [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)   | [A unified framework to test generative language models on a large number of different evaluation tasks.](lm-eval.md)                                          | _coming soon_           

diff --git a/examples/lemonade/server/ai-toolkit.md b/examples/lemonade/server/ai-toolkit.md
@@ -37,7 +37,7 @@ The AI Toolkit now supports "Bring Your Own Model" functionality, allowing you t
     http://localhost:8000/api/v0/chat/completions
     ```
 5. When prompted to "Enter the exact model name as in the API" select a model (e.g., `Phi-3-Mini-Instruct-Hybrid`)
-    - Note: You can get a list of all models available by running `curl http://localhost:8000/api/v0/models`
+    - Note: You can get a list of all models available [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md).
 6. Select the same name as the display model name.
 7. Skip the HTTP authentication step by pressing "Enter".