onnx · jeremyfowers · Apr 25, 2025 · Apr 25, 2025
diff --git a/.github/workflows/test_lemonade.yml b/.github/workflows/test_lemonade.yml
@@ -34,7 +34,7 @@ jobs:
         shell: bash -el {0}
         run: |
           python -m pip install --upgrade pip
-          conda install pylint
+          pip install pylint
           python -m pip check
           pip install -e .[llm]
       - name: Lint with Black
@@ -46,7 +46,7 @@ jobs:
         shell: bash -el {0}
         run: |
           pylint src/lemonade --rcfile .pylintrc --disable E0401
-          pylint examples --rcfile .pylintrc --disable E0401,E0611 --jobs=1
+          pylint examples --rcfile .pylintrc --disable E0401,E0611,F0010 --jobs=1 -v
       - name: Run lemonade tests
         shell: bash -el {0}
         run: |

diff --git a/.pylintrc b/.pylintrc
@@ -76,7 +76,6 @@ enable =
   expression-not-assigned,
   confusing-with-statement,
   unnecessary-lambda,
-  assign-to-new-keyword,
   redeclared-assigned-name,
   pointless-statement,
   pointless-string-statement,
@@ -118,7 +117,6 @@ enable =
   invalid-length-returned,
   protected-access,
   attribute-defined-outside-init,
-  no-init,
   abstract-method,
   invalid-overridden-method,
   # arguments-differ,
@@ -160,9 +158,7 @@ enable =
   ### format
   # Line length, indentation, whitespace:
   bad-indentation,
-  mixed-indentation,
   unnecessary-semicolon,
-  bad-whitespace,
   missing-final-newline,
   line-too-long,
   mixed-line-endings,
@@ -182,7 +178,6 @@ enable =
   import-self,
   preferred-module,
   reimported,
-  relative-import,
   deprecated-module,
   wildcard-import,
   misplaced-future,
@@ -277,12 +272,6 @@ indent-string = '    '
 # black doesn't always obey its own limit.  See pyproject.toml.
 max-line-length = 100
 
-# List of optional constructs for which whitespace checking is disabled. `dict-
-# separator` is used to allow tabulation in dicts, etc.: {1  : 1,\n222: 2}.
-# `trailing-comma` allows a space between comma and closing bracket: (a, ).
-# `empty-line` allows space-only lines.
-no-space-check =
-
 # Allow the body of a class to be on the same line as the declaration if body
 # contains single statement.
 single-line-class-stmt = no

diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ We are on a mission to make it easy to use the most important tools in the ONNX
 The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is designed to make it easy to serve, benchmark, and deploy large language models (LLMs) on a variety of hardware platforms, including CPU, GPU, and NPU. 
 
 <div align="center">
-  <img src="https://github.com/user-attachments/assets/83dd6563-f970-414c-bb8c-4f08a0bc4bfa" alt="Lemonade Demo" title="Lemonade in Action">
+  <img src="https://download.amd.com/images/lemonade_640x480_1.gif" alt="Lemonade Demo" title="Lemonade in Action">
 </div>
 
 The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is comprised of the following:
@@ -38,7 +38,7 @@ turnkey -h
 ```
 
 <div align="center">
-  <img src="https://github.com/user-attachments/assets/a1461dc4-4dac-40ca-95da-9c62e47cec24" alt="Turnkey Demo" title="Turnkey CLI">
+  <img src="https://download.amd.com/images/tkml_640x480_1.gif" alt="Turnkey Demo" title="Turnkey CLI">
 </div>
 
 ### [Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/README.md)

diff --git a/docs/contribute.md b/docs/contribute.md
@@ -17,6 +17,17 @@ The guidelines document is organized as the following sections:
 - [PyPI Release Process](#pypi-release-process)
 - [Public APIs](#public-apis)
 
+## 🍋 Contributing a Lemonade Server Demo
+
+Lemonade Server Demos aim to be reproducible in under 10 minutes, require no code changes to the app you're integrating, and use an app supporting the OpenAI API with a configurable base URL. 
+
+Please see [AI Toolkit ReadMe](https://github.com/onnx/turnkeyml/blob/main/examples/lemonade/server/ai-toolkit.md) for an example Markdown contribution.
+
+- To Submit your example, open a pull request in the TurnkeyML GitHub repo with the following:
+  - Add your .md file in the [Server Examples](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) folder.
+  - Assign your PR to the maintainers
+
+We’re excited to see what you build! If you’re unsure about your idea or need help unblocking an integration, feel free to reach out via GitHub Issues or [email](mailto:turnkeyml@amd.com).
 
 ## Contributing a model
 

diff --git a/docs/lemonade/README.md b/docs/lemonade/README.md
@@ -6,14 +6,46 @@ Lemonade SDK is built on top of [OnnxRuntime GenAI (OGA)](https://github.com/mic
 
 The Lemonade SDK provides everything needed to get up and running quickly with LLMs on OGA:
 
-- [Quick installation from PyPI](#install). 
-- [CLI with tools for prompting, benchmarking, and accuracy tests](#cli-commands).
-- [REST API with OpenAI compatibility](#serving).
-- [Python API based on `from_pretrained()` for easy integration with Python apps](#api).  
+| **Feature**                              | **Description**                                                                                     |
+|------------------------------------------|-----------------------------------------------------------------------------------------------------|
+| **🌐 Local LLM server with OpenAI API compatibility (Lemonade Server)** | Replace cloud-based LLMs with private and free LLMs that run locally on your own PC's NPU and GPU. |
+| **🖥️ CLI with tools for prompting, benchmarking, and accuracy tests**  | Enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options. |
+| **🐍 Python API based on `from_pretrained()`**                          | Provides easy integration with Python applications for loading and using LLMs.                      |
 
-# Install
 
-You can quickly get started with Lemonade by installing the `turnkeyml` [PyPI package](#installing-from-pypi) with the appropriate extras for your backend, [install from source](#installing-from-source) by cloning and installing this repository, or [with GUI installation for Lemonade Server](#installing-from-lemonade_server_installerexe).
+## Table of Contents
+
+- [Installation](#installation)
+  - [Installing Lemonade Server via Executable](#installing-from-lemonade_server_installerexe)
+  - [Installing Lemonade SDK From PyPI](#installing-from-pypi)
+  - [Installing Lemonade SDK From Source](#installing-from-source)
+- [CLI Commands](#cli-commands)
+  - [Prompting](#prompting)
+  - [Accuracy](#accuracy)
+  - [Benchmarking](#benchmarking)
+  - [LLM Report](#llm-report)
+  - [Memory Usage](#memory-usage)
+  - [Serving](#serving)
+- [API](#api)
+  - [High-Level APIs](#high-level-apis)
+  - [Low-Level API](#low-level-api)
+- [Contributing](#contributing)
+
+
+# Installation
+
+There are 3 ways a user can install the Lemonade SDK:
+
+1. Use the [Lemonade Server Installer](#installing-from-lemonade_server_installerexe). This provides a no code way to run LLMs locally and integrate with OpenAI compatible applications.
+1. Use [PyPI installation](#installing-from-pypi) by installing the `turnkeyml` package with the appropriate extras for your backend. This will install the full set of Turnkey and Lemonade SDK tools, including Lemonade Server, API, and CLI commands.
+1. Use [source installation](#installing-from-source) if you plan to contribute or customize the Lemonade SDK.
+
+
+## Installing From Lemonade_Server_Installer.exe
+
+The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
+
+The Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) has guides for how to use Lemonade Server with a collection of applications that we have tested.
 
 ## Installing From PyPI
 
@@ -54,13 +86,10 @@ To install the Lemonade SDK from PyPI:
 
 The Lemonade SDK can be installed from source code by cloning this repository and following the instructions [here](source_installation_inst.md).
 
-## Installing From Lemonade_Server_Installer.exe
-
-The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
 
 # CLI Commands
 
-The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.  
+The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.
 
 Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command.
 
@@ -174,13 +203,15 @@ You can launch an OpenAI-compatible server with:
 
 Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided as well as how to launch the server with more detailed informational messages enabled.
 
+See the Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) to see a collection of applications that we have tested with Lemonade Server.
+
 # API
 
-Lemonade is also available via API. 
+Lemonade is also available via API.
 
 ## High-Level APIs
 
-The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications.
+The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications. For more information on recipes and compatibility, see the [Lemonade API ReadMe](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/lemonade_api.md).
 
 OGA iGPU:
 ```python

diff --git a/docs/lemonade/lemonade_api.md b/docs/lemonade/lemonade_api.md
@@ -0,0 +1,123 @@
+# 🍋 Lemonade API: Model Compatibility and Recipes
+
+Lemonade API (`lemonade.api`) provides a simple, high-level interface to load and run LLM models locally. This guide helps you understand what models work with which **recipes**, what to expect in terms of compatibility, and how to choose the right setup for your hardware.
+
+## 🧠 What Is a Recipe?
+
+A **recipe** defines how a model is run — including backend (e.g., PyTorch, ONNX Runtime), quantization strategy, and device support. The `from_pretrained()` function in `lemonade.api` uses the recipe to configure everything automatically. For the list of recipes, see [Recipe Compatibility Table](#-recipe-and-checkpoint-compatibility). The following is an example of using the Lemonade API `from_pretrained()` function:
+
+```python
+from lemonade.api import from_pretrained
+
+model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="hf-cpu")
+```
+
+Function Arguments:
+- checkpoint: The Hugging Face or OGA checkpoint that defines the LLM.
+- recipe: Defines the implementation and hardware used for the LLM. Default is "hf-cpu".
+
+
+## 📜 Supported Model Formats
+
+Lemonade API currently supports:
+
+- Hugging Face hosted **safetensors** checkpoints
+- AMD **OGA** (ONNXRuntime-GenAI) ONNX checkpoints
+
+## 🍴 Recipe and Checkpoint Compatibility
+
+The following table explains what checkpoints work with each recipe, the hardware and OS requirements, and additional notes:
+
+<table>
+  <tr>
+    <th>Recipe</th>
+    <th>Checkpoint Format</th>
+    <th>Hardware Needed</th>
+    <th>Operating System</th>
+    <th>Notes</th>
+  </tr>
+  <tr>
+    <td><code>hf-cpu</code></td>
+    <td>safetensors (Hugging Face)</td>
+    <td>Any x86 CPU</td>
+    <td>Windows, Linux</td>
+    <td>Compatible with x86 CPUs, offering broad accessibility.</td>
+  </tr>
+  <tr>
+    <td><code>hf-dgpu</code></td>
+    <td>safetensors (Hugging Face)</td>
+    <td>Compatible Discrete GPU</td>
+    <td>Windows, Linux</td>
+    <td>Requires PyTorch and a compatible GPU.<sup>[1]</sup></td>
+  </tr>
+  <tr>
+    <td rowspan="2"><code>oga-cpu</code></td>
+    <td>safetensors (Hugging Face)</td>
+    <td>Any x86 CPU</td>
+    <td>Windows</td>
+    <td>Converted from safetensors via `model_builder`. Accuracy loss due to RTN quantization.</td>
+  </tr>
+  <tr>
+    <td>OGA ONNX</td>
+    <td>Any x86 CPU</td>
+    <td>Windows</td>
+    <td>Use models from the <a href="https://huggingface.co/collections/amd/oga-cpu-llm-collection-6808280dc18d268d57353be8">CPU Collection.</a></td>
+  </tr>
+  <tr>
+    <td rowspan="2"><code>oga-igpu</code></td>
+    <td>safetensors (Hugging Face)</td>
+    <td>AMD Ryzen AI PC</td>
+    <td>Windows</td>
+    <td>Converted from safetensors via `model_builder`. Accuracy loss due to RTN quantization.</td>
+  </tr>
+  <tr>
+    <td>OGA ONNX</td>
+    <td>AMD Ryzen AI PC</td>
+    <td>Windows</td>
+    <td>Use models from the <a href="https://huggingface.co/collections/amd/ryzenai-oga-dml-models-67f940914eee51cbd794b95b">GPU Collection.</a></td>
+  </tr>
+  <tr>
+    <td><code>oga-hybrid</code></td>
+    <td>Pre-quantized OGA ONNX</td>
+    <td>AMD Ryzen AI 300 series PC</td>
+    <td>Windows</td>
+    <td>Use models from the <a href="https://huggingface.co/collections/amd/ryzenai-14-llm-hybrid-models-67da31231bba0f733750a99c">Hybrid Collection</a>. Optimized with AWQ to INT4.</td>
+  </tr>
+  <tr>
+    <td><code>oga-npu</code></td>
+    <td>Pre-quantized OGA ONNX</td>
+    <td>AMD Ryzen AI 300 series PC</td>
+    <td>Windows</td>
+    <td>Use models from the <a href="https://huggingface.co/collections/amd/ryzenai-14-llm-npu-models-67da3494ec327bd3aa3c83d7">NPU Collection</a>. Optimized with AWQ to INT4.</td>
+  </tr>
+</table>
+
+<sup>[1]</sup> Compatible GPUs are those that support PyTorch's `.to("cuda")` function. Ensure you have the appropriate version of PyTorch installed (e.g., CUDA or ROCm) for your specific GPU. **Note**: Lemonade does not install PyTorch with CUDA or ROCm for you. For installation instructions, see [PyTorch's Get Started Guide](https://pytorch.org/get-started/locally/).
+
+## 🔄 Converting Models to OGA
+
+Lemonade API will do the conversion for you using OGA's `model_builder` if you pass a safetensors checkpoint.
+
+- Takes \~1–5 minutes per model.
+- Uses RTN quantization (int4).
+- For better quality, use pre-quantized models (see below).
+
+
+## 📦 Pre-Converted OGA Models
+
+You can skip the conversion step by using pre-quantized models from AMD’s Hugging Face collection. These models are optimized using **Activation Aware Quantization (AWQ)**, which provides higher-accuracy int4 quantization compared to RTN.
+
+| Recipe       | Collection                                                                                                                                      |
+| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+| `oga-hybrid` | [Hybrid Collection](https://huggingface.co/collections/amd/ryzenai-14-llm-hybrid-models-67da31231bba0f733750a99c)                               |
+| `oga-npu`    | [NPU Collection](https://huggingface.co/collections/amd/ryzenai-14-llm-npu-models-67da3494ec327bd3aa3c83d7)                                     |
+| `oga-cpu`    | [CPU Collection](https://huggingface.co/collections/amd/oga-cpu-llm-collection-6808280dc18d268d57353be8) |
+| `oga-dml`    | [GPU Collection](https://huggingface.co/collections/amd/ryzenai-oga-dml-models-67f940914eee51cbd794b95b)                                                                                                                          |
+
+
+## 📚 Additional Resources
+
+- [Lemonade API Examples](https://github.com/onnx/turnkeyml/blob/main/examples/lemonade#api-examples)
+- [lemonade.api source](https://github.com/onnx/turnkeyml/blob/main/src/lemonade/api.py)
+- [Model Support Matrix (ONNX Runtime GenAI)](https://github.com/microsoft/onnxruntime-genai)
+
diff --git a/docs/lemonade/server_integration.md b/docs/lemonade/server_integration.md
@@ -126,14 +126,7 @@ Only `Qwen2.5-0.5B-Instruct-CPU` is installed by default in silent mode. If you
 Lemonade_Server_Installer.exe /S /Extras=hybrid /Models="Qwen2.5-0.5B-Instruct-CPU Llama-3.2-1B-Instruct-Hybrid"
 ```
 
-The available modes are the following:
-* `Qwen2.5-0.5B-Instruct-CPU`
-* `Llama-3.2-1B-Instruct-Hybrid`
-* `Llama-3.2-3B-Instruct-Hybrid`
-* `Phi-3-Mini-Instruct-Hybrid`
-* `Qwen-1.5-7B-Chat-Hybrid`
-* `DeepSeek-R1-Distill-Llama-8B-Hybrid`
-* `DeepSeek-R1-Distill-Qwen-7B-Hybrid`
+The available modes are documented [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_models.md).
 
 Finally, if you don't want to create a desktop shortcut during installation, use the `/NoDesktopShortcut` parameter: