Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/test_server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,14 @@ jobs:
python -m pip install --upgrade pip
python -m pip check
pip install -e .[llm-oga-cpu]
lemonade-install --model Qwen2.5-0.5B-Instruct-CPU
- name: Run server tests
lemonade-server-dev pull Qwen2.5-0.5B-Instruct-CPU
- name: Run server tests (network online mode)
shell: bash -el {0}
run: |
python test/lemonade/server.py
- name: Run server tests (offline mode)
shell: bash -el {0}
run: |
python test/lemonade/server.py --offline


13 changes: 13 additions & 0 deletions docs/lemonade/server_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,17 @@ https://github.com/onnx/turnkeyml/releases/download/v6.0.0/Lemonade_Server_Insta

Please note that the Server Installer is only available on Windows. Apps that integrate with our server on a Linux machine must install Lemonade from source as described [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md#from-source-code).

### Installing Additional Models

Lemonade Server installations always come with at least one LLM installed. If you want to install additional models on behalf of your users, the following tools are available:

- Discovering which LLMs are available:
- [A human-readable list of supported models](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md).
- [A JSON file with the list of supported models](https://github.com/onnx/turnkeyml/tree/main/src/lemonade_server/server_models.json) is included in every Lemonade Server installation.
- Installing LLMs:
- [The `pull` endpoint in the server](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md#get-apiv0pull-).
- `lemonade-server pull MODEL` on the command line interface.

## Stand-Alone Server Integration

Some apps might prefer to be responsible for installing and managing Lemonade Server on behalf of the user. This part of the guide includes steps for installing and running Lemonade Server so that your users don't have to install Lemonade Server separately.
Expand All @@ -94,6 +105,8 @@ lemonade-server --port 8123

You can also run the server as a background process using a subprocess or any preferred method.

To stop the server, you may use the `lemonade-server stop` command, or simply terminate the process you created by keeping track of its PID. Please do not run the `lemonade-server stop` command if your application has not started the server, as the server may be used by other applications.

### Silent Installation

Silent installation runs `Lemonade_Server_Installer.exe` without a GUI and automatically accepts all prompts.
Expand Down
125 changes: 116 additions & 9 deletions docs/lemonade/server_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,22 @@ This document provides the models we recommend for use with Lemonade Server. Cli
## Naming Convention
The format of each Lemonade name is a combination of the name in the base checkpoint and the backend where the model will run. So, if the base checkpoint is `meta-llama/Llama-3.2-1B-Instruct`, and it has been optimized to run on Hybrid, the resulting name is Llama-3.2-3B-Instruct-Hybrid.

## Supported Models
<details>
<summary>Qwen2.5-0.5B-Instruct-CPU</summary>
## Installing Additional Models

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx](https://huggingface.co/amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx) |
| Recipe | oga-cpu |
| Reasoning | False |
Once you've installed Lemonade Server, you can install any model on this list using the `lemonade-server pull` command.

</details>
Example:

```bash
lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
```

> Note: `lemonade-server` is a utility that is added to your PATH when you install Lemonade Server with the GUI installer.
> If you are using Lemonade Server from a Python environment, use the `lemonade-server-dev pull` command instead.

## Supported Models

### Hybrid

<details>
<summary>Llama-3.2-1B-Instruct-Hybrid</summary>
Expand Down Expand Up @@ -84,3 +89,105 @@ The format of each Lemonade name is a combination of the name in the base checkp

</details>

<details>
<summary>Mistral-7B-v0.3-Instruct-Hybrid</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid) |
| Recipe | oga-hybrid |
| Reasoning | False |

</details>

<details>
<summary>Llama-3.1-8B-Instruct-Hybrid</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Llama-3.1-8B-Instruct-awq-asym-uint4-g128-lmhead-onnx-hybrid](https://huggingface.co/amd/Llama-3.1-8B-Instruct-awq-asym-uint4-g128-lmhead-onnx-hybrid) |
| Recipe | oga-hybrid |
| Reasoning | False |

</details>


### CPU

<details>
<summary>Qwen2.5-0.5B-Instruct-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx](https://huggingface.co/amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx) |
| Recipe | oga-cpu |
| Reasoning | False |

</details>

<details>
<summary>Llama-3.2-1B-Instruct-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Llama-3.2-1B-Instruct-awq-uint4-float16-cpu-onnx](https://huggingface.co/amd/Llama-3.2-1B-Instruct-awq-uint4-float16-cpu-onnx) |
| Recipe | oga-cpu |
| Reasoning | False |

</details>

<details>
<summary>Llama-3.2-3B-Instruct-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Llama-3.2-3B-Instruct-awq-uint4-float16-cpu-onnx](https://huggingface.co/amd/Llama-3.2-3B-Instruct-awq-uint4-float16-cpu-onnx) |
| Recipe | oga-cpu |
| Reasoning | False |

</details>

<details>
<summary>Phi-3-Mini-Instruct-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Phi-3-mini-4k-instruct_int4_float16_onnx_cpu](https://huggingface.co/amd/Phi-3-mini-4k-instruct_int4_float16_onnx_cpu) |
| Recipe | oga-cpu |
| Reasoning | False |

</details>

<details>
<summary>Qwen-1.5-7B-Chat-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/Qwen1.5-7B-Chat_uint4_asym_g128_float16_onnx_cpu](https://huggingface.co/amd/Qwen1.5-7B-Chat_uint4_asym_g128_float16_onnx_cpu) |
| Recipe | oga-cpu |
| Reasoning | False |

</details>

<details>
<summary>DeepSeek-R1-Distill-Llama-8B-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu) |
| Recipe | oga-cpu |
| Reasoning | True |

</details>

<details>
<summary>DeepSeek-R1-Distill-Qwen-7B-CPU</summary>

| Key | Value |
| --- | ----- |
| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu) |
| Recipe | oga-cpu |
| Reasoning | True |

</details>

35 changes: 33 additions & 2 deletions docs/lemonade/server_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ They focus on enabling client applications by extending existing cloud-focused A
- Unload models to save memory space.

The additional endpoints under development are:
- POST `/api/v0/pull` - Install a model
- POST `/api/v0/load` - Load a model
- POST `/api/v0/unload` - Unload a model
- POST `/api/v0/params` - Set generation parameters
Expand Down Expand Up @@ -250,9 +251,40 @@ curl http://localhost:8000/api/v0/models

## Additional Endpoints

### `GET /api/v0/pull` <sub>![Status](https://img.shields.io/badge/status-fully_available-green)</sub>

Install a model by downloading it and registering it with Lemonade Server.

#### Parameters

| Parameter | Required | Description |
|-----------|----------|-------------|
| `model_name` | Yes | [Lemonade Server model name](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md) to load. |

Example request:

```bash
curl http://localhost:8000/api/v0/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
```

Response format:

```json
{
"status":"success",
"message":"Installed model: Qwen2.5-0.5B-Instruct-CPU"
}
```

In case of an error, the status will be `error` and the message will contain the error message.

### `GET /api/v0/load` <sub>![Status](https://img.shields.io/badge/status-fully_available-green)</sub>

Explicitly load a model into memory. This is useful to ensure that the model is loaded before you make a request.
Explicitly load a model into memory. This is useful to ensure that the model is loaded before you make a request. Installs the model if necessary.

#### Parameters

Expand Down Expand Up @@ -321,7 +353,6 @@ Response format:

In case of an error, the status will be `error` and the message will contain the error message.


### `POST /api/v0/unload` <sub>![Status](https://img.shields.io/badge/status-partially_available-red)</sub>

Explicitly unload a model from memory. This is useful to free up memory while still leaving the server process running (which takes minimal resources but a few seconds to start).
Expand Down
1 change: 1 addition & 0 deletions examples/lemonade/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ This allows the same application to leverage local LLMs instead of relying on Op
| [Continue](https://www.continue.dev/) | [How to use Lemonade LLMs as a coding assistant in Continue](continue.md) | _coming soon_ |
| [Microsoft AI Toolkit](https://learn.microsoft.com/en-us/windows/ai/toolkit/) | [Experimenting with Lemonade LLMs in VS Code using Microsoft's AI Toolkit](ai-toolkit.md) | _coming soon_ |
| [CodeGPT](https://codegpt.co/) | [How to use Lemonade LLMs as a coding assistant in CodeGPT](codeGPT.md) | _coming soon_ |
[MindCraft](mindcraft.md) | [How to use Lemonade LLMs as a Minecraft agent](mindcraft.md) | _coming soon_ |
| [wut](https://github.com/shobrook/wut) | [Terminal assistant that uses Lemonade LLMs to explain errors](wut.md) | _coming soon_ |
| [AnythingLLM](https://anythingllm.com/) | [Running agents locally with Lemonade and AnythingLLM](anythingLLM.md) | _coming soon_ |
| [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) | [A unified framework to test generative language models on a large number of different evaluation tasks.](lm-eval.md) | _coming soon_
Expand Down
2 changes: 1 addition & 1 deletion examples/lemonade/server/ai-toolkit.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ The AI Toolkit now supports "Bring Your Own Model" functionality, allowing you t
http://localhost:8000/api/v0/chat/completions
```
5. When prompted to "Enter the exact model name as in the API" select a model (e.g., `Phi-3-Mini-Instruct-Hybrid`)
- Note: You can get a list of all models available by running `curl http://localhost:8000/api/v0/models`
- Note: You can get a list of all models available [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md).
6. Select the same name as the display model name.
7. Skip the HTTP authentication step by pressing "Enter".

Expand Down
Loading
Loading