Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/test_lemonade.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
shell: bash -el {0}
run: |
python -m pip install --upgrade pip
conda install pylint
pip install pylint
python -m pip check
pip install -e .[llm]
- name: Lint with Black
Expand All @@ -46,7 +46,7 @@ jobs:
shell: bash -el {0}
run: |
pylint src/lemonade --rcfile .pylintrc --disable E0401
pylint examples --rcfile .pylintrc --disable E0401,E0611 --jobs=1
pylint examples --rcfile .pylintrc --disable E0401,E0611,F0010 --jobs=1 -v
- name: Run lemonade tests
shell: bash -el {0}
run: |
Expand Down
11 changes: 0 additions & 11 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,6 @@ enable =
expression-not-assigned,
confusing-with-statement,
unnecessary-lambda,
assign-to-new-keyword,
redeclared-assigned-name,
pointless-statement,
pointless-string-statement,
Expand Down Expand Up @@ -118,7 +117,6 @@ enable =
invalid-length-returned,
protected-access,
attribute-defined-outside-init,
no-init,
abstract-method,
invalid-overridden-method,
# arguments-differ,
Expand Down Expand Up @@ -160,9 +158,7 @@ enable =
### format
# Line length, indentation, whitespace:
bad-indentation,
mixed-indentation,
unnecessary-semicolon,
bad-whitespace,
missing-final-newline,
line-too-long,
mixed-line-endings,
Expand All @@ -182,7 +178,6 @@ enable =
import-self,
preferred-module,
reimported,
relative-import,
deprecated-module,
wildcard-import,
misplaced-future,
Expand Down Expand Up @@ -277,12 +272,6 @@ indent-string = ' '
# black doesn't always obey its own limit. See pyproject.toml.
max-line-length = 100

# List of optional constructs for which whitespace checking is disabled. `dict-
# separator` is used to allow tabulation in dicts, etc.: {1 : 1,\n222: 2}.
# `trailing-comma` allows a space between comma and closing bracket: (a, ).
# `empty-line` allows space-only lines.
no-space-check =

# Allow the body of a class to be on the same line as the declaration if body
# contains single statement.
single-line-class-stmt = no
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ We are on a mission to make it easy to use the most important tools in the ONNX
The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is designed to make it easy to serve, benchmark, and deploy large language models (LLMs) on a variety of hardware platforms, including CPU, GPU, and NPU.

<div align="center">
<img src="https://github.com/user-attachments/assets/83dd6563-f970-414c-bb8c-4f08a0bc4bfa" alt="Lemonade Demo" title="Lemonade in Action">
<img src="https://download.amd.com/images/lemonade_640x480_1.gif" alt="Lemonade Demo" title="Lemonade in Action">
</div>

The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is comprised of the following:
Expand All @@ -38,7 +38,7 @@ turnkey -h
```

<div align="center">
<img src="https://github.com/user-attachments/assets/a1461dc4-4dac-40ca-95da-9c62e47cec24" alt="Turnkey Demo" title="Turnkey CLI">
<img src="https://download.amd.com/images/tkml_640x480_1.gif" alt="Turnkey Demo" title="Turnkey CLI">
</div>

### [Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/README.md)
Expand Down
11 changes: 11 additions & 0 deletions docs/contribute.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ The guidelines document is organized as the following sections:
- [PyPI Release Process](#pypi-release-process)
- [Public APIs](#public-apis)

## 🍋 Contributing a Lemonade Server Demo

Lemonade Server Demos aim to be reproducible in under 10 minutes, require no code changes to the app you're integrating, and use an app supporting the OpenAI API with a configurable base URL.

Please see [AI Toolkit ReadMe](https://github.com/onnx/turnkeyml/blob/main/examples/lemonade/server/ai-toolkit.md) for an example Markdown contribution.

- To Submit your example, open a pull request in the TurnkeyML GitHub repo with the following:
- Add your .md file in the [Server Examples](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) folder.
- Assign your PR to the maintainers

We’re excited to see what you build! If you’re unsure about your idea or need help unblocking an integration, feel free to reach out via GitHub Issues or [email](mailto:turnkeyml@amd.com).

## Contributing a model

Expand Down
55 changes: 43 additions & 12 deletions docs/lemonade/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,46 @@ Lemonade SDK is built on top of [OnnxRuntime GenAI (OGA)](https://github.com/mic

The Lemonade SDK provides everything needed to get up and running quickly with LLMs on OGA:

- [Quick installation from PyPI](#install).
- [CLI with tools for prompting, benchmarking, and accuracy tests](#cli-commands).
- [REST API with OpenAI compatibility](#serving).
- [Python API based on `from_pretrained()` for easy integration with Python apps](#api).
| **Feature** | **Description** |
|------------------------------------------|-----------------------------------------------------------------------------------------------------|
| **🌐 Local LLM server with OpenAI API compatibility (Lemonade Server)** | Replace cloud-based LLMs with private and free LLMs that run locally on your own PC's NPU and GPU. |
| **🖥️ CLI with tools for prompting, benchmarking, and accuracy tests** | Enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options. |
| **🐍 Python API based on `from_pretrained()`** | Provides easy integration with Python applications for loading and using LLMs. |

# Install

You can quickly get started with Lemonade by installing the `turnkeyml` [PyPI package](#installing-from-pypi) with the appropriate extras for your backend, [install from source](#installing-from-source) by cloning and installing this repository, or [with GUI installation for Lemonade Server](#installing-from-lemonade_server_installerexe).
## Table of Contents

- [Installation](#installation)
- [Installing Lemonade Server via Executable](#installing-from-lemonade_server_installerexe)
- [Installing Lemonade SDK From PyPI](#installing-from-pypi)
- [Installing Lemonade SDK From Source](#installing-from-source)
- [CLI Commands](#cli-commands)
- [Prompting](#prompting)
- [Accuracy](#accuracy)
- [Benchmarking](#benchmarking)
- [LLM Report](#llm-report)
- [Memory Usage](#memory-usage)
- [Serving](#serving)
- [API](#api)
- [High-Level APIs](#high-level-apis)
- [Low-Level API](#low-level-api)
- [Contributing](#contributing)


# Installation

There are 3 ways a user can install the Lemonade SDK:

1. Use the [Lemonade Server Installer](#installing-from-lemonade_server_installerexe). This provides a no code way to run LLMs locally and integrate with OpenAI compatible applications.
1. Use [PyPI installation](#installing-from-pypi) by installing the `turnkeyml` package with the appropriate extras for your backend. This will install the full set of Turnkey and Lemonade SDK tools, including Lemonade Server, API, and CLI commands.
1. Use [source installation](#installing-from-source) if you plan to contribute or customize the Lemonade SDK.


## Installing From Lemonade_Server_Installer.exe

The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.

The Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) has guides for how to use Lemonade Server with a collection of applications that we have tested.

## Installing From PyPI

Expand Down Expand Up @@ -54,13 +86,10 @@ To install the Lemonade SDK from PyPI:

The Lemonade SDK can be installed from source code by cloning this repository and following the instructions [here](source_installation_inst.md).

## Installing From Lemonade_Server_Installer.exe

The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.

# CLI Commands

The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.
The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.

Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command.

Expand Down Expand Up @@ -174,13 +203,15 @@ You can launch an OpenAI-compatible server with:

Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided as well as how to launch the server with more detailed informational messages enabled.

See the Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) to see a collection of applications that we have tested with Lemonade Server.

# API

Lemonade is also available via API.
Lemonade is also available via API.

## High-Level APIs

The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications.
The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications. For more information on recipes and compatibility, see the [Lemonade API ReadMe](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/lemonade_api.md).

OGA iGPU:
```python
Expand Down
123 changes: 123 additions & 0 deletions docs/lemonade/lemonade_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# 🍋 Lemonade API: Model Compatibility and Recipes

Lemonade API (`lemonade.api`) provides a simple, high-level interface to load and run LLM models locally. This guide helps you understand what models work with which **recipes**, what to expect in terms of compatibility, and how to choose the right setup for your hardware.

## 🧠 What Is a Recipe?

A **recipe** defines how a model is run — including backend (e.g., PyTorch, ONNX Runtime), quantization strategy, and device support. The `from_pretrained()` function in `lemonade.api` uses the recipe to configure everything automatically. For the list of recipes, see [Recipe Compatibility Table](#-recipe-and-checkpoint-compatibility). The following is an example of using the Lemonade API `from_pretrained()` function:

```python
from lemonade.api import from_pretrained

model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="hf-cpu")
```

Function Arguments:
- checkpoint: The Hugging Face or OGA checkpoint that defines the LLM.
- recipe: Defines the implementation and hardware used for the LLM. Default is "hf-cpu".


## 📜 Supported Model Formats

Lemonade API currently supports:

- Hugging Face hosted **safetensors** checkpoints
- AMD **OGA** (ONNXRuntime-GenAI) ONNX checkpoints

## 🍴 Recipe and Checkpoint Compatibility

The following table explains what checkpoints work with each recipe, the hardware and OS requirements, and additional notes:

<table>
<tr>
<th>Recipe</th>
<th>Checkpoint Format</th>
<th>Hardware Needed</th>
<th>Operating System</th>
<th>Notes</th>
</tr>
<tr>
<td><code>hf-cpu</code></td>
<td>safetensors (Hugging Face)</td>
<td>Any x86 CPU</td>
<td>Windows, Linux</td>
<td>Compatible with x86 CPUs, offering broad accessibility.</td>
</tr>
<tr>
<td><code>hf-dgpu</code></td>
<td>safetensors (Hugging Face)</td>
<td>Compatible Discrete GPU</td>
<td>Windows, Linux</td>
<td>Requires PyTorch and a compatible GPU.<sup>[1]</sup></td>
</tr>
<tr>
<td rowspan="2"><code>oga-cpu</code></td>
<td>safetensors (Hugging Face)</td>
<td>Any x86 CPU</td>
<td>Windows</td>
<td>Converted from safetensors via `model_builder`. Accuracy loss due to RTN quantization.</td>
</tr>
<tr>
<td>OGA ONNX</td>
<td>Any x86 CPU</td>
<td>Windows</td>
<td>Use models from the <a href="https://huggingface.co/collections/amd/oga-cpu-llm-collection-6808280dc18d268d57353be8">CPU Collection.</a></td>
</tr>
<tr>
<td rowspan="2"><code>oga-igpu</code></td>
<td>safetensors (Hugging Face)</td>
<td>AMD Ryzen AI PC</td>
<td>Windows</td>
<td>Converted from safetensors via `model_builder`. Accuracy loss due to RTN quantization.</td>
</tr>
<tr>
<td>OGA ONNX</td>
<td>AMD Ryzen AI PC</td>
<td>Windows</td>
<td>Use models from the <a href="https://huggingface.co/collections/amd/ryzenai-oga-dml-models-67f940914eee51cbd794b95b">GPU Collection.</a></td>
</tr>
<tr>
<td><code>oga-hybrid</code></td>
<td>Pre-quantized OGA ONNX</td>
<td>AMD Ryzen AI 300 series PC</td>
<td>Windows</td>
<td>Use models from the <a href="https://huggingface.co/collections/amd/ryzenai-14-llm-hybrid-models-67da31231bba0f733750a99c">Hybrid Collection</a>. Optimized with AWQ to INT4.</td>
</tr>
<tr>
<td><code>oga-npu</code></td>
<td>Pre-quantized OGA ONNX</td>
<td>AMD Ryzen AI 300 series PC</td>
<td>Windows</td>
<td>Use models from the <a href="https://huggingface.co/collections/amd/ryzenai-14-llm-npu-models-67da3494ec327bd3aa3c83d7">NPU Collection</a>. Optimized with AWQ to INT4.</td>
</tr>
</table>

<sup>[1]</sup> Compatible GPUs are those that support PyTorch's `.to("cuda")` function. Ensure you have the appropriate version of PyTorch installed (e.g., CUDA or ROCm) for your specific GPU. **Note**: Lemonade does not install PyTorch with CUDA or ROCm for you. For installation instructions, see [PyTorch's Get Started Guide](https://pytorch.org/get-started/locally/).

## 🔄 Converting Models to OGA

Lemonade API will do the conversion for you using OGA's `model_builder` if you pass a safetensors checkpoint.

- Takes \~1–5 minutes per model.
- Uses RTN quantization (int4).
- For better quality, use pre-quantized models (see below).


## 📦 Pre-Converted OGA Models

You can skip the conversion step by using pre-quantized models from AMD’s Hugging Face collection. These models are optimized using **Activation Aware Quantization (AWQ)**, which provides higher-accuracy int4 quantization compared to RTN.

| Recipe | Collection |
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `oga-hybrid` | [Hybrid Collection](https://huggingface.co/collections/amd/ryzenai-14-llm-hybrid-models-67da31231bba0f733750a99c) |
| `oga-npu` | [NPU Collection](https://huggingface.co/collections/amd/ryzenai-14-llm-npu-models-67da3494ec327bd3aa3c83d7) |
| `oga-cpu` | [CPU Collection](https://huggingface.co/collections/amd/oga-cpu-llm-collection-6808280dc18d268d57353be8) |
| `oga-dml` | [GPU Collection](https://huggingface.co/collections/amd/ryzenai-oga-dml-models-67f940914eee51cbd794b95b) |


## 📚 Additional Resources

- [Lemonade API Examples](https://github.com/onnx/turnkeyml/blob/main/examples/lemonade#api-examples)
- [lemonade.api source](https://github.com/onnx/turnkeyml/blob/main/src/lemonade/api.py)
- [Model Support Matrix (ONNX Runtime GenAI)](https://github.com/microsoft/onnxruntime-genai)

9 changes: 1 addition & 8 deletions docs/lemonade/server_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,14 +126,7 @@ Only `Qwen2.5-0.5B-Instruct-CPU` is installed by default in silent mode. If you
Lemonade_Server_Installer.exe /S /Extras=hybrid /Models="Qwen2.5-0.5B-Instruct-CPU Llama-3.2-1B-Instruct-Hybrid"
```

The available modes are the following:
* `Qwen2.5-0.5B-Instruct-CPU`
* `Llama-3.2-1B-Instruct-Hybrid`
* `Llama-3.2-3B-Instruct-Hybrid`
* `Phi-3-Mini-Instruct-Hybrid`
* `Qwen-1.5-7B-Chat-Hybrid`
* `DeepSeek-R1-Distill-Llama-8B-Hybrid`
* `DeepSeek-R1-Distill-Qwen-7B-Hybrid`
The available modes are documented [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md).

Finally, if you don't want to create a desktop shortcut during installation, use the `/NoDesktopShortcut` parameter:

Expand Down
Loading
Loading