Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test_server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip check
pip install -e .[llm]
lemonade-install --model Qwen2.5-0.5B-Instruct-CPU
- name: Run server tests
shell: bash -el {0}
run: |
Expand Down
45 changes: 26 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,47 @@
# Welcome to ONNX TurnkeyML

[![Turnkey tests](https://github.com/onnx/turnkeyml/actions/workflows/test_turnkey.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
[![Lemonade tests](https://github.com/onnx/turnkeyml/actions/workflows/test_lemonade.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
[![Turnkey tests](https://github.com/onnx/turnkeyml/actions/workflows/test_turnkey.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
[![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
[![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")

We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing no-code CLIs and low-code APIs for both general ONNX workflows with `turnkey` as well as LLMs with `lemonade`.
We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing a full SDK for LLMs with the Lemonade SDK, as well as a no-code CLIs for general ONNX workflows with `turnkey`.

| [**Lemonade SDK**](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md) | [**Turnkey**](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/getting_started.md) |
|:----------------------------------------------: |:-----------------------------------------------------------------: |
| Serve and benchmark LLMs on CPU, GPU, and NPU. <br/> [Click here to get started with `lemonade`.](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md) | Export and optimize ONNX models for CNNs and Transformers. <br/> [Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/getting_started.md) |
| <img src="https://github.com/onnx/turnkeyml/blob/main/img/llm_demo.png?raw=true"/> | <img src="https://github.com/onnx/turnkeyml/blob/main/img/classic_demo.png?raw=true"/> |
## 🍋 Lemonade SDK: Quickly serve, benchmark and deploy LLMs

The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is designed to make it easy to serve, benchmark, and deploy large language models (LLMs) on a variety of hardware platforms, including CPU, GPU, and NPU.

## How It Works
<div align="center">
<img src="https://github.com/user-attachments/assets/83dd6563-f970-414c-bb8c-4f08a0bc4bfa" alt="Lemonade Demo" title="Lemonade in Action">
</div>

The `turnkey` (CNNs and transformers) and `lemonade` (LLMs) CLIs provide a set of `Tools` that users can invoke in a `Sequence`. The first `Tool` takes the input (`-i`), performs some action, and passes its state to the next `Tool` in the `Sequence`.
The [Lemonade SDK](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md) is comprised of the following:

You can read the `Sequence` out like a sentence. For example, the demo command above was:
- 🌐 **Lemonade Server**: A server interface that uses the standard Open AI API, allowing applications to integrate with local LLMs.
- 🐍 **Lemonade Python API**: Offers High-Level API for easy integration of Lemonade LLMs into Python applications and Low-Level API for custom experiments.
- 🖥️ **Lemonade CLI**: The `lemonade` CLI lets you mix-and-match LLMs, frameworks (PyTorch, ONNX, GGUF), and measurement tools to run experiments. The available tools are:
- Prompting an LLM.
- Measuring the accuracy of an LLM using a variety of tests.
- Benchmarking an LLM to get the time-to-first-token and tokens per second.
- Profiling the memory usage of an LLM.

```
> turnkey -i bert.py discover export-pytorch optimize-ort convert-fp16
```
### [Click here to get started with Lemonade.](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md)

Which you can read like:
## 🔑 Turnkey: A Complementary Tool for ONNX Workflows

> Use `turnkey` on `bert.py` to `discover` the model, `export` the `pytorch` to ONNX, `optimize` the ONNX with `ort`, and `convert` the ONNX to `fp16`.
While Lemonade focuses on LLMs, [Turnkey](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/README.md) is a no-code CLI designed for general ONNX workflows, such as exporting and optimizing CNNs and Transformers.

You can configure each `Tool` by passing it arguments. For example, `export-pytorch --opset 18` would set the opset of the resulting ONNX model to 18.
To see the list of supported tools, using the following command:

A full command with an argument looks like:

```
> turnkey -i bert.py discover export-pytorch --opset 18 optimize-ort convert-fp16
```bash
turnkey -h
```

<div align="center">
<img src="https://github.com/user-attachments/assets/a1461dc4-4dac-40ca-95da-9c62e47cec24" alt="Turnkey Demo" title="Turnkey CLI">
</div>

### [Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/turnkey/README.md)

## Contributing

Expand Down
66 changes: 31 additions & 35 deletions docs/lemonade/getting_started.md → docs/lemonade/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Lemonade SDK
# 🍋 Lemonade SDK

The `lemonade` SDK provides everything needed to get up and running quickly with LLMs on OnnxRuntime GenAI (OGA).
*The long-term objective of the Lemonade SDK is to provide the ONNX ecosystem with the same kind of tools available in the GGUF ecosystem.*

Lemonade SDK is built on top of [OnnxRuntime GenAI (OGA)](https://github.com/microsoft/onnxruntime-genai), an ONNX LLM inference engine developed by Microsoft to improve the LLM experience on AI PCs, especially those with accelerator hardware such as Neural Processing Units (NPUs).

The Lemonade SDK provides everything needed to get up and running quickly with LLMs on OGA:

- [Quick installation from PyPI](#install).
- [CLI with tools for prompting, benchmarking, and accuracy tests](#cli-commands).
Expand All @@ -9,56 +13,48 @@ The `lemonade` SDK provides everything needed to get up and running quickly with

# Install

You can quickly get started with `lemonade` by installing the `turnkeyml` [PyPI package](#from-pypi) with the appropriate extras for your backend, or you can [install from source](#from-source-code) by cloning and installing this repository.
You can quickly get started with Lemonade by installing the `turnkeyml` [PyPI package](#installing-from-pypi) with the appropriate extras for your backend, [install from source](#installing-from-source) by cloning and installing this repository, or [with GUI installation for Lemonade Server](#installing-from-lemonade_server_installerexe).

## From PyPI
## Installing From PyPI

To install `lemonade` from PyPI:
To install the Lemonade SDK from PyPI:

1. Create and activate a [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe) environment.
```bash
conda create -n lemon python=3.10
```

```bash
conda activate lemon
```

3. Install lemonade for you backend of choice:
3. Install Lemonade for your backend of choice:
- [OnnxRuntime GenAI with CPU backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
```bash
pip install -e turnkeyml[llm-oga-cpu]
pip install turnkeyml[llm-oga-cpu]
```
- [OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
> Note: Requires Windows and a DirectML-compatible iGPU.
```bash
pip install -e turnkeyml[llm-oga-igpu]
pip install turnkeyml[llm-oga-igpu]
```
- OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend:
> Note: Ryzen AI Hybrid requires a Windows 11 PC with a AMD Ryzen™ AI 9 HX375, Ryzen AI 9 HX370, or Ryzen AI 9 365 processor.
> - Install the [Ryzen AI driver >= 32.0.203.237](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers) (you can check your driver version under Device Manager > Neural Processors).
> - Visit the [AMD Hugging Face page](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-fp16-onnx-hybrid-13-674b307d2ffa21dd68fa41d5) for supported checkpoints.
```bash
pip install -e turnkeyml[llm-oga-hybrid]
lemonade-install --ryzenai hybrid
```
> Note: Ryzen AI Hybrid requires a Windows 11 PC with an AMD Ryzen™ AI 300-series processor.

- Follow the environment setup instructions [here](https://ryzenai.docs.amd.com/en/latest/llm/high_level_python.html)
- Hugging Face (PyTorch) LLMs for CPU backend:
```bash
pip install -e turnkeyml[llm]
pip install turnkeyml[llm]
```
- llama.cpp: see [instructions](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/llamacpp.md).

4. Use `lemonade -h` to explore the LLM tools, and see the [command](#cli-commands) and [API](#api) examples below.

## Installing From Source

## From Source Code

To install `lemonade` from source code:

1. Clone: `git clone https://github.com/onnx/turnkeyml.git`
1. `cd turnkeyml` (where `turnkeyml` is the repo root of your clone)
- Note: be sure to run these installation instructions from the repo root.
1. Follow the same instructions as in the [PyPI installation](#from-pypi), except replace the `turnkeyml` with a `.`.
- For example: `pip install -e .[llm-oga-igpu]`
The Lemonade SDK can be installed from source code by cloning this repository and following the instructions [here](source_installation_inst.md).

## From Lemonade_Server_Installer.exe
## Installing From Lemonade_Server_Installer.exe

The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.

Expand All @@ -76,14 +72,14 @@ lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4

Can be read like this:

> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), on to the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.
> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), onto the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.

The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool.


## Prompting

To prompt your LLM try:
To prompt your LLM, try one of the following:

OGA iGPU:
```bash
Expand All @@ -101,11 +97,11 @@ You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint yo

You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device.

Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about those tools.
Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about these tools.

## Accuracy

To measure the accuracy of an LLM using MMLU, try this:
To measure the accuracy of an LLM using MMLU (Measuring Massive Multitask Language Understanding), try the following:

OGA iGPU:
```bash
Expand All @@ -117,13 +113,13 @@ Hugging Face:
lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management
```

That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`.
This command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`. You can also run other subject tests by replacing management with the new test subject name. For the full list of supported subjects, see the [MMLU Accuracy Read Me](mmlu_accuracy.md).

You can run the full suite of MMLU subjects by omitting the `--test` argument. You can learn more about this with `lemonade accuracy-mmlu -h`.

## Benchmarking

To measure the time-to-first-token and tokens/second of an LLM, try this:
To measure the time-to-first-token and tokens/second of an LLM, try the following:

OGA iGPU:
```bash
Expand All @@ -135,7 +131,7 @@ Hugging Face:
lemonade -i facebook/opt-125m huggingface-load huggingface-bench
```

That command will run a few warmup iterations, then a few generation iterations where performance data is collected.
This command will run a few warm-up iterations, then a few generation iterations where performance data is collected.

The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` or `lemonade huggingface-bench -h`.

Expand Down Expand Up @@ -173,15 +169,15 @@ You can launch an OpenAI-compatible server with:
lemonade serve
```

Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided.
Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided as well as how to launch the server with more detailed informational messages enabled.

# API

Lemonade is also available via API.

## High-Level APIs

The high-level lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate lemonade LLMs into Python applications.
The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications.

OGA iGPU:
```python
Expand Down
20 changes: 10 additions & 10 deletions docs/lemonade/ort_genai_igpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@

## Installation

See [lemonade installation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md#install) for the OGA iGPU backend.
See [Lemonade Installation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md#install) for the OGA iGPU backend.

## Get models

- The oga-load tool can download models from Hugging Face and build ONNX files using OGA's `model_builder`, which can quantized and optimize models for both igpu and cpu.
- The oga-load tool can download models from Hugging Face and build ONNX files using OGA's `model_builder`, which can quantize and optimize models for both iGPU and CPU.
- Download and build ONNX model files:
- `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4`
- `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4`
- The ONNX model files will be stored in the respective subfolder of the lemonade cache folder and will be reused in future oga-load calls:
- `oga_models\microsoft_phi-3-mini-4k-instruct\dml-int4`
- `oga_models\microsoft_phi-3-mini-4k-instruct\cpu-int4`
- The ONNX model build process can be forced to run again, overwriting the above cache, by using the --force flag:
- The ONNX model build process can be forced to run again, overwriting the above cache, by using the `--force` flag:
- `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 --force`
- Transformer model architectures supported by the model_builder tool include many popular state-of-the-art models:
- Transformer model architectures supported by the model_builder tool include many popular state-of-the-art models, such as:
- Gemma
- LLaMa
- Mistral
Expand All @@ -26,16 +26,16 @@ See [lemonade installation](https://github.com/onnx/turnkeyml/blob/main/docs/lem
- Nemotron
- For the full list of supported models, please see the [model_builder documentation](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md).
- The following quantizations are supported for automatically building ONNXRuntime GenAI model files from the Hugging Face repository:
- cpu: fp32, int4
- igpu: fp16, int4
- `cpu`: `fp32`, `int4`
- `igpu`: `fp16`, `int4`

## Directory structure:
- The model_builder tool caches Hugging Face files and temporary ONNX external data files in `<LEMONADE CACHE>\model_builder`
- The output from model_builder is stored in `<LEMONADE_CACHE>\oga_models\<MODELNAME>\<SUBFOLDER>`
- `MODELNAME` is the Hugging Face checkpoint name where any '/' is mapped to an '_' and everything is lower case.
- `SUBFOLDER` is `<EP>-<DTYPE>`, where `EP` is the execution provider (`dml` for igpu, `cpu` for cpu, and `npu` for npu) and `DTYPE` is the datatype.
- If the --int4-block-size flag is used then `SUBFOLDER` is` <EP>-<DTYPE>-block-<SIZE>` where `SIZE` is the specified block size.
- Other ONNX models in the format required by onnxruntime-genai can be loaded in lemonade if placed in the `<LEMONADE_CACHE>\oga_models` folder.
- Use the -i and --subfolder flags to specify the folder and subfolder:
- `SUBFOLDER` is `<EP>-<DTYPE>`, where `EP` is the execution provider (`dml` for `igpu`, `cpu` for `cpu`, and `npu` for `npu`) and `DTYPE` is the datatype.
- If the `--int4-block-size` flag is used then `SUBFOLDER` is` <EP>-<DTYPE>-block-<SIZE>` where `SIZE` is the specified block size.
- Other ONNX models in the format required by onnxruntime-genai can be loaded by Lemonade if placed in the `<LEMONADE_CACHE>\oga_models` folder.
- Use the `-i` and `--subfolder` flags to specify the folder and subfolder, for example:
- `lemonade -i my_model_name --subfolder my_subfolder --device igpu --dtype int4 oga-load`
- Lemonade will expect the ONNX model files to be located in `<LEMONADE_CACHE>\oga_models\my_model_name\my_subfolder`
2 changes: 2 additions & 0 deletions docs/lemonade/server_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ The available modes are the following:
* `Llama-3.2-3B-Instruct-Hybrid`
* `Phi-3-Mini-Instruct-Hybrid`
* `Qwen-1.5-7B-Chat-Hybrid`
* `DeepSeek-R1-Distill-Llama-8B-Hybrid`
* `DeepSeek-R1-Distill-Qwen-7B-Hybrid`

### Command Line Invocation

Expand Down
Loading
Loading