Skip to content

Commit 8b577ec

Browse files
jeremyfowersdanielholandaramkrishna2910vgodsoekovtcharov
authored
Release v6.2.3 (#323)
Co-authored-by: Daniel Holanda <holand.daniel@gmail.com> Co-authored-by: Krishna Sivakumar <Krishna.Sivakumar@amd.com> Co-authored-by: Victoria Godsoe <victoria.godsoe@amd.com> Co-authored-by: Kalin Ovtcharov <kalin@extropolis.ai>
1 parent f358003 commit 8b577ec

File tree

29 files changed

+1030
-110
lines changed

29 files changed

+1030
-110
lines changed

.github/workflows/test_server.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ jobs:
3737
python -m pip check
3838
pip install -e .[llm-oga-cpu]
3939
lemonade-server-dev pull Qwen2.5-0.5B-Instruct-CPU
40+
- name: Run server tests (unit tests)
41+
shell: bash -el {0}
42+
run: |
43+
python test/lemonade/server_unit.py
4044
- name: Run server tests (network online mode)
4145
shell: bash -el {0}
4246
run: |

docs/lemonade/server_models.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
2626
<details>
2727
<summary>Llama-3.2-1B-Instruct-Hybrid</summary>
2828

29+
```bash
30+
lemonade-server pull Llama-3.2-1B-Instruct-Hybrid
31+
```
32+
2933
| Key | Value |
3034
| --- | ----- |
3135
| Checkpoint | [amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid) |
@@ -37,6 +41,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
3741
<details>
3842
<summary>Llama-3.2-3B-Instruct-Hybrid</summary>
3943

44+
```bash
45+
lemonade-server pull Llama-3.2-3B-Instruct-Hybrid
46+
```
47+
4048
| Key | Value |
4149
| --- | ----- |
4250
| Checkpoint | [amd/Llama-3.2-3B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Llama-3.2-3B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid) |
@@ -48,6 +56,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
4856
<details>
4957
<summary>Phi-3-Mini-Instruct-Hybrid</summary>
5058

59+
```bash
60+
lemonade-server pull Phi-3-Mini-Instruct-Hybrid
61+
```
62+
5163
| Key | Value |
5264
| --- | ----- |
5365
| Checkpoint | [amd/Phi-3-mini-4k-instruct-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Phi-3-mini-4k-instruct-awq-g128-int4-asym-fp16-onnx-hybrid) |
@@ -59,6 +71,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
5971
<details>
6072
<summary>Qwen-1.5-7B-Chat-Hybrid</summary>
6173

74+
```bash
75+
lemonade-server pull Qwen-1.5-7B-Chat-Hybrid
76+
```
77+
6278
| Key | Value |
6379
| --- | ----- |
6480
| Checkpoint | [amd/Qwen1.5-7B-Chat-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Qwen1.5-7B-Chat-awq-g128-int4-asym-fp16-onnx-hybrid) |
@@ -70,6 +86,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
7086
<details>
7187
<summary>DeepSeek-R1-Distill-Llama-8B-Hybrid</summary>
7288

89+
```bash
90+
lemonade-server pull DeepSeek-R1-Distill-Llama-8B-Hybrid
91+
```
92+
7393
| Key | Value |
7494
| --- | ----- |
7595
| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-hybrid](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-hybrid) |
@@ -81,6 +101,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
81101
<details>
82102
<summary>DeepSeek-R1-Distill-Qwen-7B-Hybrid</summary>
83103

104+
```bash
105+
lemonade-server pull DeepSeek-R1-Distill-Qwen-7B-Hybrid
106+
```
107+
84108
| Key | Value |
85109
| --- | ----- |
86110
| Checkpoint | [amd/DeepSeek-R1-Distill-Qwen-7B-awq-asym-uint4-g128-lmhead-onnx-hybrid](https://huggingface.co/amd/DeepSeek-R1-Distill-Qwen-7B-awq-asym-uint4-g128-lmhead-onnx-hybrid) |
@@ -92,6 +116,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
92116
<details>
93117
<summary>Mistral-7B-v0.3-Instruct-Hybrid</summary>
94118

119+
```bash
120+
lemonade-server pull Mistral-7B-v0.3-Instruct-Hybrid
121+
```
122+
95123
| Key | Value |
96124
| --- | ----- |
97125
| Checkpoint | [amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid](https://huggingface.co/amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid) |
@@ -103,6 +131,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
103131
<details>
104132
<summary>Llama-3.1-8B-Instruct-Hybrid</summary>
105133

134+
```bash
135+
lemonade-server pull Llama-3.1-8B-Instruct-Hybrid
136+
```
137+
106138
| Key | Value |
107139
| --- | ----- |
108140
| Checkpoint | [amd/Llama-3.1-8B-Instruct-awq-asym-uint4-g128-lmhead-onnx-hybrid](https://huggingface.co/amd/Llama-3.1-8B-Instruct-awq-asym-uint4-g128-lmhead-onnx-hybrid) |
@@ -117,6 +149,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
117149
<details>
118150
<summary>Qwen2.5-0.5B-Instruct-CPU</summary>
119151

152+
```bash
153+
lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
154+
```
155+
120156
| Key | Value |
121157
| --- | ----- |
122158
| Checkpoint | [amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx](https://huggingface.co/amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx) |
@@ -128,6 +164,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
128164
<details>
129165
<summary>Llama-3.2-1B-Instruct-CPU</summary>
130166

167+
```bash
168+
lemonade-server pull Llama-3.2-1B-Instruct-CPU
169+
```
170+
131171
| Key | Value |
132172
| --- | ----- |
133173
| Checkpoint | [amd/Llama-3.2-1B-Instruct-awq-uint4-float16-cpu-onnx](https://huggingface.co/amd/Llama-3.2-1B-Instruct-awq-uint4-float16-cpu-onnx) |
@@ -139,6 +179,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
139179
<details>
140180
<summary>Llama-3.2-3B-Instruct-CPU</summary>
141181

182+
```bash
183+
lemonade-server pull Llama-3.2-3B-Instruct-CPU
184+
```
185+
142186
| Key | Value |
143187
| --- | ----- |
144188
| Checkpoint | [amd/Llama-3.2-3B-Instruct-awq-uint4-float16-cpu-onnx](https://huggingface.co/amd/Llama-3.2-3B-Instruct-awq-uint4-float16-cpu-onnx) |
@@ -150,6 +194,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
150194
<details>
151195
<summary>Phi-3-Mini-Instruct-CPU</summary>
152196

197+
```bash
198+
lemonade-server pull Phi-3-Mini-Instruct-CPU
199+
```
200+
153201
| Key | Value |
154202
| --- | ----- |
155203
| Checkpoint | [amd/Phi-3-mini-4k-instruct_int4_float16_onnx_cpu](https://huggingface.co/amd/Phi-3-mini-4k-instruct_int4_float16_onnx_cpu) |
@@ -161,6 +209,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
161209
<details>
162210
<summary>Qwen-1.5-7B-Chat-CPU</summary>
163211

212+
```bash
213+
lemonade-server pull Qwen-1.5-7B-Chat-CPU
214+
```
215+
164216
| Key | Value |
165217
| --- | ----- |
166218
| Checkpoint | [amd/Qwen1.5-7B-Chat_uint4_asym_g128_float16_onnx_cpu](https://huggingface.co/amd/Qwen1.5-7B-Chat_uint4_asym_g128_float16_onnx_cpu) |
@@ -172,6 +224,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
172224
<details>
173225
<summary>DeepSeek-R1-Distill-Llama-8B-CPU</summary>
174226

227+
```bash
228+
lemonade-server pull DeepSeek-R1-Distill-Llama-8B-CPU
229+
```
230+
175231
| Key | Value |
176232
| --- | ----- |
177233
| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu) |
@@ -183,6 +239,10 @@ lemonade-server pull Qwen2.5-0.5B-Instruct-CPU
183239
<details>
184240
<summary>DeepSeek-R1-Distill-Qwen-7B-CPU</summary>
185241

242+
```bash
243+
lemonade-server pull DeepSeek-R1-Distill-Qwen-7B-CPU
244+
```
245+
186246
| Key | Value |
187247
| --- | ----- |
188248
| Checkpoint | [amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu](https://huggingface.co/amd/DeepSeek-R1-Distill-Llama-8B-awq-asym-uint4-g128-lmhead-onnx-cpu) |

docs/lemonade/server_spec.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ We are also actively investigating and developing [additional endpoints](#additi
99
### OpenAI-Compatible Endpoints
1010
- POST `/api/v0/chat/completions` - Chat Completions (messages -> completion)
1111
- POST `/api/v0/completions` - Text Completions (prompt -> completion)
12+
- POST `api/v0/responses` - Chat Completions (prompt|messages -> event)
1213
- GET `/api/v0/models` - List models available locally
1314

1415
### Additional Endpoints
@@ -65,6 +66,7 @@ Chat Completions API. You provide a list of messages and receive a completion. T
6566
| `stop` | No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
6667
| `logprobs` | No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | <sub>![Status](https://img.shields.io/badge/not_available-red)</sub> |
6768
| `temperature` | No | What sampling temperature to use. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
69+
| `tools` | No | A list of tools the model may call. Only available when `stream` is set to `False`. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
6870
| `max_tokens` | No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with `max_completion_tokens`. This value is now deprecated by OpenAI in favor of `max_completion_tokens` | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
6971
| `max_completion_tokens` | No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with `max_tokens`. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
7072

@@ -207,6 +209,86 @@ The following format is used for both streaming and non-streaming responses:
207209
}
208210
```
209211

212+
213+
214+
### `POST /api/v0/responses` <sub>![Status](https://img.shields.io/badge/status-partially_available-green)</sub>
215+
216+
Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
217+
218+
#### Parameters
219+
220+
| Parameter | Required | Description | Status |
221+
|-----------|----------|-------------|--------|
222+
| `input` | Yes | A list of dictionaries or a string input for the model to respond to. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
223+
| `model` | Yes | The model to use for the response. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
224+
| `max_output_tokens` | No | The maximum number of output tokens to generate. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
225+
| `temperature` | No | What sampling temperature to use. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
226+
| `stream` | No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | <sub>![Status](https://img.shields.io/badge/available-green)</sub> |
227+
228+
> Note: The value for `model` is either a [Lemonade Server model name](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md), or a checkpoint that has been pre-loaded using the [load endpoint](#get-apiv0load-status).
229+
230+
#### Streaming Events
231+
232+
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
233+
- `response.created`
234+
- `response.output_text.delta`
235+
- `response.completed`
236+
237+
For a full list of event types, see the [API reference for streaming](https://platform.openai.com/docs/api-reference/responses-streaming).
238+
239+
#### Example request
240+
241+
PowerShell:
242+
243+
```powershell
244+
Invoke-WebRequest -Uri "http://localhost:8000/api/v0/responses" `
245+
-Method POST `
246+
-Headers @{ "Content-Type" = "application/json" } `
247+
-Body '{
248+
"model": "Llama-3.2-1B-Instruct-Hybrid",
249+
"input": "What is the population of Paris?",
250+
"stream": false
251+
}'
252+
```
253+
254+
Bash:
255+
256+
```bash
257+
curl -X POST http://localhost:8000/api/v0/responses \
258+
-H "Content-Type: application/json" \
259+
-d '{
260+
"model": "Llama-3.2-1B-Instruct-Hybrid",
261+
"input": "What is the population of Paris?",
262+
"stream": false
263+
}'
264+
```
265+
266+
267+
#### Response format
268+
269+
For non-streaming responses:
270+
271+
```json
272+
{
273+
"id": "0",
274+
"created_at": 1746225832.0,
275+
"model": "Llama-3.2-1B-Instruct-Hybrid",
276+
"object": "response",
277+
"output": [{
278+
"id": "0",
279+
"content": [{
280+
"annotations": [],
281+
"text": "Paris has a population of approximately 2.2 million people in the city proper."
282+
}]
283+
}]
284+
}
285+
```
286+
287+
For streaming responses, the API returns a series of events. Refer to [OpenAI streaming guide](https://platform.openai.com/docs/guides/streaming-responses?api-mode=responses) for details.
288+
289+
290+
291+
210292
### `GET /api/v0/models` <sub>![Status](https://img.shields.io/badge/status-fully_available-green)</sub>
211293

212294
Returns a list of key models available on the server in an OpenAI-compatible format. We also expanded each model object with the `checkpoint` and `recipe` fields, which may be used to load a model using the `load` endpoint.

examples/lemonade/demos/chat/chat_hybrid.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import sys
22
from threading import Thread, Event
33
from transformers import StoppingCriteriaList
4-
from lemonade.tools.serve import StopOnEvent
4+
from lemonade.tools.server.serve import StopOnEvent
55
from lemonade.api import from_pretrained
66
from lemonade.tools.ort_genai.oga import OrtGenaiStreamer
77

examples/lemonade/demos/chat/chat_start.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from queue import Queue
44
from time import sleep
55
from transformers import StoppingCriteriaList
6-
from lemonade.tools.serve import StopOnEvent
6+
from lemonade.tools.server.serve import StopOnEvent
77

88

99
class TextStreamer:

examples/lemonade/demos/search/search_hybrid.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from transformers import StoppingCriteriaList
44
from lemonade.api import from_pretrained
55
from lemonade.tools.ort_genai.oga import OrtGenaiStreamer
6-
from lemonade.tools.serve import StopOnEvent
6+
from lemonade.tools.server.serve import StopOnEvent
77

88
employee_handbook = """
99
1. You will work very hard every day.\n

examples/lemonade/demos/search/search_start.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from queue import Queue
44
from time import sleep
55
from transformers import StoppingCriteriaList
6-
from lemonade.tools.serve import StopOnEvent
6+
from lemonade.tools.server.serve import StopOnEvent
77

88

99
employee_handbook = """

examples/lemonade/server/README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,17 @@ This allows the same application to leverage local LLMs instead of relying on Op
88

99
| App | Guide | Video |
1010
|---------------------|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
11-
| [Open WebUI](https://github.com/open-webui/open-webui) | [How to chat with Lemonade LLMs in Open WebUI](https://ryzenai.docs.amd.com/en/latest/llm/server_interface.html#open-webui-demo) | [Watch Demo](https://www.youtube.com/watch?v=PXNTDZREJ_A) |
12-
| [Continue](https://www.continue.dev/) | [How to use Lemonade LLMs as a coding assistant in Continue](continue.md) | _coming soon_ |
13-
| [Microsoft AI Toolkit](https://learn.microsoft.com/en-us/windows/ai/toolkit/) | [Experimenting with Lemonade LLMs in VS Code using Microsoft's AI Toolkit](ai-toolkit.md) | _coming soon_ |
14-
| [CodeGPT](https://codegpt.co/) | [How to use Lemonade LLMs as a coding assistant in CodeGPT](codeGPT.md) | _coming soon_ |
15-
[MindCraft](mindcraft.md) | [How to use Lemonade LLMs as a Minecraft agent](mindcraft.md) | _coming soon_ |
16-
| [wut](https://github.com/shobrook/wut) | [Terminal assistant that uses Lemonade LLMs to explain errors](wut.md) | _coming soon_ |
17-
| [AnythingLLM](https://anythingllm.com/) | [Running agents locally with Lemonade and AnythingLLM](anythingLLM.md) | _coming soon_ |
18-
| [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) | [A unified framework to test generative language models on a large number of different evaluation tasks.](lm-eval.md) | _coming soon_
11+
| [Open WebUI](https://github.com/open-webui/open-webui) | [How to chat with Lemonade LLMs in Open WebUI](https://ryzenai.docs.amd.com/en/latest/llm/server_interface.html#open-webui-demo) | [Watch Demo](https://www.youtube.com/watch?v=PXNTDZREJ_A) |
12+
| [Continue](https://www.continue.dev/) | [How to use Lemonade LLMs as a coding assistant in Continue](continue.md) | [Watch Demo](https://youtu.be/bP_MZnDpbUc?si=hRhLbLEV6V_OGlUt) |
13+
| [Microsoft AI Toolkit](https://learn.microsoft.com/en-us/windows/ai/toolkit/) | [Experimenting with Lemonade LLMs in VS Code using Microsoft's AI Toolkit](ai-toolkit.md) | [Watch Demo](https://youtu.be/JecpotOZ6qo?si=WxWVQhUBCJQgE6vX) |
14+
| [GAIA](https://github.com/amd/gaia) | [An application for running LLMs locally, includes a ChatBot, YouTube Agent, and more](https://github.com/amd/gaia?tab=readme-ov-file#getting-started-guide) | [Watch Demo](https://youtu.be/_PORHv_-atI?si=EYQjmrRQ6Zy2H0ek) |
15+
| [CodeGPT](https://codegpt.co/) | [How to use Lemonade LLMs as a coding assistant in CodeGPT](codeGPT.md) | _coming soon_ |
16+
| [MindCraft](mindcraft.md) | [How to use Lemonade LLMs as a Minecraft agent](mindcraft.md) | _coming soon_ |
17+
| [wut](https://github.com/shobrook/wut) | [Terminal assistant that uses Lemonade LLMs to explain errors](wut.md) | _coming soon_ |
18+
| [AnythingLLM](https://anythingllm.com/) | [Running agents locally with Lemonade and AnythingLLM](anythingLLM.md) | _coming soon_ |
19+
| [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) | [A unified framework to test generative language models on a large number of different evaluation tasks.](lm-eval.md) | _coming soon_ |
20+
| [PEEL](https://github.com/lemonade-apps/peel) | [Using Local LLMs in Windows PowerShell](https://github.com/lemonade-apps/peel?tab=readme-ov-file#installation) | _coming soon_ |
21+
1922
## 📦 Looking for Installation Help?
2023

2124
To set up Lemonade Server, check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality. For more information about 🍋 Lemonade SDK, see the [Lemonade SDK README](https://github.com/onnx/turnkeyml/tree/main/docs/lemonade/).

examples/lemonade/server/mindcraft.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -342,4 +342,4 @@ The following are examples of requests made by the Mindcraft software to the Lem
342342
TRACE: ::1:56890 - ASGI [6] Send {'type': 'http.response.body', 'body': '<0 bytes>', 'more_body': False}
343343
TRACE: ::1:56890 - ASGI [6] Completed
344344
TRACE: ::1:56890 - HTTP connection lost
345-
```
345+
```

0 commit comments

Comments
 (0)