onnx · jeremyfowers · Mar 28, 2025 · Mar 24, 2025 · Mar 28, 2025 · Mar 28, 2025
diff --git a/.github/workflows/test_server.yml b/.github/workflows/test_server.yml
@@ -36,6 +36,7 @@ jobs:
           python -m pip install --upgrade pip
           python -m pip check
           pip install -e .[llm]
+          lemonade-install --model Qwen2.5-0.5B-Instruct-CPU
       - name: Run server tests
         shell: bash -el {0}
         run: |

diff --git a/README.md b/README.md
@@ -1,40 +1,47 @@
 # Welcome to ONNX TurnkeyML
 
-[![Turnkey tests](https://github.com/onnx/turnkeyml/actions/workflows/test_turnkey.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
 [![Lemonade tests](https://github.com/onnx/turnkeyml/actions/workflows/test_lemonade.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
+[![Turnkey tests](https://github.com/onnx/turnkeyml/actions/workflows/test_turnkey.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
 [![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
 [![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
 
-We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing no-code CLIs and low-code APIs for both general ONNX workflows with `turnkey` as well as LLMs with `lemonade`.
+We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing a full SDK for LLMs with the Lemonade SDK, as well as a no-code CLIs for general ONNX workflows with `turnkey`.
 
-|                     [**Lemonade SDK**](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md)                    	|                            [**Turnkey**](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/getting_started.md)                                	|
-|:----------------------------------------------:	|:-----------------------------------------------------------------:	|
-| Serve and benchmark LLMs on CPU, GPU, and NPU. <br/>	[Click here to get started with `lemonade`.](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md) | Export and optimize ONNX models for CNNs and Transformers. <br/>	[Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/getting_started.md)	|
-| <img src="https://github.com/onnx/turnkeyml/blob/main/img/llm_demo.png?raw=true"/> | <img src="https://github.com/onnx/turnkeyml/blob/main/img/classic_demo.png?raw=true"/> |
+## 🍋 Lemonade SDK: Quickly serve, benchmark and deploy LLMs
 
+The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is designed to make it easy to serve, benchmark, and deploy large language models (LLMs) on a variety of hardware platforms, including CPU, GPU, and NPU. 
 
-## How It Works
+<div align="center">
+  <img src="https://github.com/user-attachments/assets/83dd6563-f970-414c-bb8c-4f08a0bc4bfa" alt="Lemonade Demo" title="Lemonade in Action">
+</div>
 
-The `turnkey` (CNNs and transformers) and `lemonade` (LLMs) CLIs provide a set of `Tools` that users can invoke in a `Sequence`. The first `Tool` takes the input (`-i`), performs some action, and passes its state to the next `Tool` in the `Sequence`.
+The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is comprised of the following:
 
-You can read the `Sequence` out like a sentence. For example, the demo command above was:
+- 🌐 **Lemonade Server**: A server interface that uses the standard Open AI API, allowing applications to integrate with local LLMs.
+- 🐍 **Lemonade Python API**: Offers High-Level API for easy integration of Lemonade LLMs into Python applications and Low-Level API for custom experiments.
+- 🖥️ **Lemonade CLI**: The `lemonade` CLI lets you mix-and-match LLMs, frameworks (PyTorch, ONNX, GGUF), and measurement tools to run experiments. The available tools are:
+  - Prompting an LLM.
+  - Measuring the accuracy of an LLM using a variety of tests.
+  - Benchmarking an LLM to get the time-to-first-token and tokens per second.
+  - Profiling the memory usage of an LLM.
 
-```
-> turnkey -i bert.py discover export-pytorch optimize-ort convert-fp16
-```
+### [Click here to get started with Lemonade.](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md)
 
-Which you can read like:
+## 🔑 Turnkey: A Complementary Tool for ONNX Workflows
 
-> Use `turnkey` on `bert.py` to `discover` the model, `export` the `pytorch` to ONNX, `optimize` the ONNX with `ort`, and `convert` the ONNX to `fp16`.
+While Lemonade focuses on LLMs, [Turnkey](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/README.md) is a no-code CLI designed for general ONNX workflows, such as exporting and optimizing CNNs and Transformers.
 
-You can configure each `Tool` by passing it arguments. For example, `export-pytorch --opset 18` would set the opset of the resulting ONNX model to 18.
+To see the list of supported tools, using the following command:
 
-A full command with an argument looks like:
-
-```
-> turnkey -i bert.py discover export-pytorch --opset 18 optimize-ort convert-fp16
+```bash
+turnkey -h
 ```
 
+<div align="center">
+  <img src="https://github.com/user-attachments/assets/a1461dc4-4dac-40ca-95da-9c62e47cec24" alt="Turnkey Demo" title="Turnkey CLI">
+</div>
+
+### [Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/README.md)
 
 ## Contributing
 

diff --git a/docs/lemonade/getting_started.md → docs/lemonade/README.md b/docs/lemonade/getting_started.md → docs/lemonade/README.md
@@ -1,6 +1,10 @@
-# Lemonade SDK
+# 🍋 Lemonade SDK
 
-The `lemonade` SDK provides everything needed to get up and running quickly with LLMs on OnnxRuntime GenAI (OGA). 
+*The long-term objective of the Lemonade SDK is to provide the ONNX ecosystem with the same kind of tools available in the GGUF ecosystem.*
+
+Lemonade SDK is built on top of [OnnxRuntime GenAI (OGA)](https://github.com/microsoft/onnxruntime-genai), an ONNX LLM inference engine developed by Microsoft to improve the LLM experience on AI PCs, especially those with accelerator hardware such as Neural Processing Units (NPUs).
+
+The Lemonade SDK provides everything needed to get up and running quickly with LLMs on OGA:
 
 - [Quick installation from PyPI](#install). 
 - [CLI with tools for prompting, benchmarking, and accuracy tests](#cli-commands).
@@ -9,56 +13,48 @@ The `lemonade` SDK provides everything needed to get up and running quickly with
 
 # Install
 
-You can quickly get started with `lemonade` by installing the `turnkeyml` [PyPI package](#from-pypi) with the appropriate extras for your backend, or you can [install from source](#from-source-code) by cloning and installing this repository.
+You can quickly get started with Lemonade by installing the `turnkeyml` [PyPI package](#installing-from-pypi) with the appropriate extras for your backend, [install from source](#installing-from-source) by cloning and installing this repository, or [with GUI installation for Lemonade Server](#installing-from-lemonade_server_installerexe).
 
-## From PyPI
+## Installing From PyPI
 
-To install `lemonade` from PyPI:
+To install the Lemonade SDK from PyPI:
 
 1. Create and activate a [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe) environment.
     ```bash
     conda create -n lemon python=3.10
+    ```
+
+    ```bash
     conda activate lemon
     ```
 
-3. Install lemonade for you backend of choice: 
+3. Install Lemonade for your backend of choice: 
     - [OnnxRuntime GenAI with CPU backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md): 
         ```bash
-            pip install -e turnkeyml[llm-oga-cpu]
+        pip install turnkeyml[llm-oga-cpu]
         ```
     - [OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
         > Note: Requires Windows and a DirectML-compatible iGPU.
         ```bash
-            pip install -e turnkeyml[llm-oga-igpu]
+        pip install turnkeyml[llm-oga-igpu]
         ```
     - OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend:
-        > Note: Ryzen AI Hybrid requires a Windows 11 PC with a AMD Ryzen™ AI 9 HX375, Ryzen AI 9 HX370, or Ryzen AI 9 365 processor.
-        > - Install the [Ryzen AI driver >= 32.0.203.237](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers) (you can check your driver version under Device Manager > Neural Processors).
-        > - Visit the [AMD Hugging Face page](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-fp16-onnx-hybrid-13-674b307d2ffa21dd68fa41d5) for supported checkpoints.
-        ```bash
-            pip install -e turnkeyml[llm-oga-hybrid]
-            lemonade-install --ryzenai hybrid
-        ```
+        > Note: Ryzen AI Hybrid requires a Windows 11 PC with an AMD Ryzen™ AI 300-series processor.
+
+        - Follow the environment setup instructions [here](https://ryzenai.docs.amd.com/en/latest/llm/high_level_python.html)
     - Hugging Face (PyTorch) LLMs for CPU backend:
         ```bash
-            pip install -e turnkeyml[llm]
+            pip install turnkeyml[llm]
         ```
     - llama.cpp: see [instructions](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/llamacpp.md).
 
 4. Use `lemonade -h` to explore the LLM tools, and see the [command](#cli-commands) and [API](#api) examples below.
 
+## Installing From Source
 
-## From Source Code
-
-To install `lemonade` from source code:
-
-1. Clone: `git clone https://github.com/onnx/turnkeyml.git`
-1. `cd turnkeyml` (where `turnkeyml` is the repo root of your clone)
-    - Note: be sure to run these installation instructions from the repo root.
-1. Follow the same instructions as in the [PyPI installation](#from-pypi), except replace the `turnkeyml` with a `.`.
-    - For example: `pip install -e .[llm-oga-igpu]`
+The Lemonade SDK can be installed from source code by cloning this repository and following the instructions [here](source_installation_inst.md).
 
-## From Lemonade_Server_Installer.exe
+## Installing From Lemonade_Server_Installer.exe
 
 The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
 
@@ -76,14 +72,14 @@ lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4
 
 Can be read like this:
 
-> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), on to the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.
+> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), onto the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.
 
 The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool.
 
 
 ## Prompting
 
-To prompt your LLM try:
+To prompt your LLM, try one of the following:
 
 OGA iGPU:
 ```bash
@@ -101,11 +97,11 @@ You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint yo
 
 You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device.
 
-Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about those tools.
+Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about these tools.
 
 ## Accuracy
 
-To measure the accuracy of an LLM using MMLU, try this:
+To measure the accuracy of an LLM using MMLU (Measuring Massive Multitask Language Understanding), try the following:
 
 OGA iGPU:
 ```bash
@@ -117,13 +113,13 @@ Hugging Face:
     lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management
 ```
 
-That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`.
+This command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`. You can also run other subject tests by replacing management with the new test subject name. For the full list of supported subjects, see the [MMLU Accuracy Read Me](mmlu_accuracy.md).
 
 You can run the full suite of MMLU subjects by omitting the `--test` argument. You can learn more about this with `lemonade accuracy-mmlu -h`.
 
 ## Benchmarking
 
-To measure the time-to-first-token and tokens/second of an LLM, try this:
+To measure the time-to-first-token and tokens/second of an LLM, try the following:
 
 OGA iGPU:
 ```bash
@@ -135,7 +131,7 @@ Hugging Face:
     lemonade -i facebook/opt-125m huggingface-load huggingface-bench
 ```
 
-That command will run a few warmup iterations, then a few generation iterations where performance data is collected.
+This command will run a few warm-up iterations, then a few generation iterations where performance data is collected.
 
 The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` or `lemonade huggingface-bench -h`.
 
@@ -173,15 +169,15 @@ You can launch an OpenAI-compatible server with:
     lemonade serve
 ```
 
-Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided.
+Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided as well as how to launch the server with more detailed informational messages enabled.
 
 # API
 
 Lemonade is also available via API. 
 
 ## High-Level APIs
 
-The high-level lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate lemonade LLMs into Python applications.
+The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications.
 
 OGA iGPU:
 ```python

diff --git a/docs/lemonade/ort_genai_igpu.md b/docs/lemonade/ort_genai_igpu.md
@@ -4,20 +4,20 @@
 
 ## Installation
 
-See [lemonade installation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md#install) for the OGA iGPU backend.
+See [Lemonade Installation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md#install) for the OGA iGPU backend.
 
 ## Get models
 
-- The oga-load tool can download models from Hugging Face and build ONNX files using OGA's `model_builder`, which can quantized and optimize models for both igpu and cpu.
+- The oga-load tool can download models from Hugging Face and build ONNX files using OGA's `model_builder`, which can quantize and optimize models for both iGPU and CPU.
 - Download and build ONNX model files:
   - `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4`
   - `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4`
 - The ONNX model files will be stored in the respective subfolder of the lemonade cache folder and will be reused in future oga-load calls:
   - `oga_models\microsoft_phi-3-mini-4k-instruct\dml-int4`
   - `oga_models\microsoft_phi-3-mini-4k-instruct\cpu-int4`
-- The ONNX model build process can be forced to run again, overwriting the above cache, by using the --force flag:
+- The ONNX model build process can be forced to run again, overwriting the above cache, by using the `--force` flag:
   - `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 --force`
-- Transformer model architectures supported by the model_builder tool include many popular state-of-the-art models:
+- Transformer model architectures supported by the model_builder tool include many popular state-of-the-art models, such as:
   - Gemma
   - LLaMa
   - Mistral
@@ -26,16 +26,16 @@ See [lemonade installation](https://github.com/onnx/turnkeyml/blob/main/docs/lem
   - Nemotron
 - For the full list of supported models, please see the [model_builder documentation](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md).
 - The following quantizations are supported for automatically building ONNXRuntime GenAI model files from the Hugging Face repository:
-  - cpu: fp32, int4
-  - igpu: fp16, int4
+  - `cpu`: `fp32`, `int4`
+  - `igpu`: `fp16`, `int4`
 
 ## Directory structure:
 - The model_builder tool caches Hugging Face files and temporary ONNX external data files in `<LEMONADE CACHE>\model_builder`
 - The output from model_builder is stored in `<LEMONADE_CACHE>\oga_models\<MODELNAME>\<SUBFOLDER>`
   - `MODELNAME` is the Hugging Face checkpoint name where any '/' is mapped to an '_' and everything is lower case.
-  - `SUBFOLDER` is `<EP>-<DTYPE>`, where `EP` is the execution provider (`dml` for igpu, `cpu` for cpu, and `npu` for npu) and `DTYPE` is the datatype.
-  - If the --int4-block-size flag is used then `SUBFOLDER` is` <EP>-<DTYPE>-block-<SIZE>` where `SIZE` is the specified block size.
-- Other ONNX models in the format required by onnxruntime-genai can be loaded in lemonade if placed in the `<LEMONADE_CACHE>\oga_models` folder.
-  - Use the -i and --subfolder flags to specify the folder and subfolder:
+  - `SUBFOLDER` is `<EP>-<DTYPE>`, where `EP` is the execution provider (`dml` for `igpu`, `cpu` for `cpu`, and `npu` for `npu`) and `DTYPE` is the datatype.
+  - If the `--int4-block-size` flag is used then `SUBFOLDER` is` <EP>-<DTYPE>-block-<SIZE>` where `SIZE` is the specified block size.
+- Other ONNX models in the format required by onnxruntime-genai can be loaded by Lemonade if placed in the `<LEMONADE_CACHE>\oga_models` folder.
+  - Use the `-i` and `--subfolder` flags to specify the folder and subfolder, for example:
     - `lemonade -i my_model_name --subfolder my_subfolder --device igpu --dtype int4 oga-load`
   - Lemonade will expect the ONNX model files to be located in `<LEMONADE_CACHE>\oga_models\my_model_name\my_subfolder`
diff --git a/docs/lemonade/server_integration.md b/docs/lemonade/server_integration.md
@@ -95,6 +95,8 @@ The available modes are the following:
 * `Llama-3.2-3B-Instruct-Hybrid`
 * `Phi-3-Mini-Instruct-Hybrid`
 * `Qwen-1.5-7B-Chat-Hybrid`
+* `DeepSeek-R1-Distill-Llama-8B-Hybrid`
+* `DeepSeek-R1-Distill-Qwen-7B-Hybrid`
 
 ### Command Line Invocation