Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/build/web.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: Learn how to build ONNX Runtime from source to deploy on the web
nav_order: 4
redirect_from: /docs/how-to/build/web
---

Cloud to the Edge – This layer ensures flexibility and performance wherever your workloads run. Foundry is designed to extend seamlessly from the cloud to the edge, and Foundry Local is already live on hundreds of millions of Windows (and Mac) devices.
# Build ONNX Runtime for Web
{: .no_toc }

Expand Down Expand Up @@ -168,7 +168,7 @@ This is the last stage in the build process, please follow the sections in a seq

- Download artifacts from pipeline manually.

you can download prebuilt WebAssembly artifacts from [Windows WebAssembly CI Pipeline](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=161&_a=summary). Select a build, download artifact "Release_wasm" and unzip. See instructions below to put files into destination folders.
you can download prebuilt WebAssembly artifacts from [Windows WebAssembly CI Pipeline](https://github.com/microsoft/onnxruntime/actions/workflows/web.yml). Select a build, download artifact "Release_wasm" and unzip. See instructions below to put files into destination folders.

- Build WebAssembly artifacts.

Expand Down
2 changes: 1 addition & 1 deletion docs/execution-providers/TensorRTRTX-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Currently TensorRT RTX supports RTX GPUs from Ampere or later architectures. Sup
Please select the Nvidia TensorRT RTX version of Onnx Runtime: https://onnxruntime.ai/docs/install. (TODO!)

## Build from source
See [Build instructions](../build/eps.md#TensorRT-RTX).
See [Build instructions](../build/eps.md#tensorrt-rtx).

## Requirements

Expand Down
5 changes: 2 additions & 3 deletions docs/extensions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@ has_children: true
nav_order: 7
---

# ONNXRuntime-Extensions
# ONNX Runtime Extensions

[![Build Status](https://dev.azure.com/onnxruntime/onnxruntime/_apis/build/status%2Fmicrosoft.onnxruntime-extensions?branchName=main)](https://dev.azure.com/onnxruntime/onnxruntime/_build/latest?definitionId=209&branchName=main)

ONNXRuntime-Extensions is a library that extends the capability of the ONNX models and inference with ONNX Runtime, via the ONNX Runtime custom operator interface. It includes a set of Custom Operators to support common model pre and post-processing for audio, vision, text, and language models. As with ONNX Runtime, Extensions also supports multiple languages and platforms (Python on Windows/Linux/macOS, Android and iOS mobile platforms and Web assembly for web).
ONNX Runtime Extensions is a library that extends the capability of the ONNX models and inference with ONNX Runtime, via the ONNX Runtime custom operator interface. It includes a set of Custom Operators to support common model pre and post-processing for audio, vision, text, and language models. As with ONNX Runtime, Extensions also supports multiple languages and platforms (Python on Windows/Linux/macOS, Android and iOS mobile platforms and Web assembly for web).

The basic workflow is to add the custom operators to an ONNX model and then to perform inference on the enhanced model with ONNX Runtime and ONNXRuntime-Extensions packages.

Expand Down
154 changes: 1 addition & 153 deletions docs/genai/howto/build-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,158 +8,6 @@ nav_order: 3
---

# Generate models using Model Builder
{: .no_toc }

* TOC placeholder
{:toc}
Refer to [model builder guide](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md) for the latest documentation.

The model builder greatly accelerates creating optimized and quantized ONNX models that run with the ONNX Runtime generate() API.

## Current Support
The tool currently supports the following model architectures.

- Gemma
- LLaMA
- Mistral
- Phi

## Installation

Model builder is available as an [Olive](https://github.com/microsoft/olive) pass. It is also shipped as part of the onnxruntime-genai Python package. You can also download and run it standalone.

In any case, you need to have the following packages installed.

```bash
pip install torch transformers onnx onnxruntime
```

### Install from package

```bash
pip install --pre onnxruntime-genai
```

#### Direct download

```bash
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py -o builder.py
```

### Usage

For all available options, please use the `-h/--help` flag.

```bash
# From wheel:
python3 -m onnxruntime_genai.models.builder --help

# From source:
python3 builder.py --help
```

### Original PyTorch Model from HuggingFace

This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).

```bash

# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files

# From source:
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
```

### Original PyTorch Model from Disk

This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
```
# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved

# From source:
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
```

### Customized or Finetuned PyTorch Model
This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face.
```
# From wheel:
python3 -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider

# From source:
python3 builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider
```

### GGUF Model
This scenario is where your float16/float32 GGUF model is already on disk.
```
# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files

# From source:
python3 builder.py -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
```

### Extra Options
This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to `--extra_options`.
```
# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx

# From source:
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
```
To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above.

### Config Only
This scenario is for when you already have your optimized and/or quantized ONNX model and you need to create the config files to run with ONNX Runtime generate() API.
```
# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true

# From source:
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
```

Afterwards, please open the `genai_config.json` file in the output folder and modify the fields as needed for your model. You should store your ONNX model in the output folder as well.

### Unit Testing Models
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it.

```
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your_model_name"
cache_dir = "cache_dir_to_save_hf_files"

model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
model.save_pretrained(cache_dir)

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
tokenizer.save_pretrained(cache_dir)
```

#### Option 1: Use the model builder tool directly
This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers.
```
# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4

# From source:
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
```

#### Option 2: Edit the config.json file on disk and then run the model builder tool

1. Navigate to where the PyTorch model and its associated files are saved on disk.
2. Modify `num_hidden_layers` in `config.json` to your desired target (e.g. 4 layers).
3. Run the below command for the model builder tool.

```
# From wheel:
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved

# From source:
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
```
40 changes: 40 additions & 0 deletions docs/genai/howto/past-present-share-buffer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: Past present share buffer
description: How to configure the past present share buffer using the ONNX Runtime generate() API
has_children: false
parent: How to
grand_parent: Generate API (Preview)
nav_order: 6
---

# How to configure the past present share buffer

The past present share buffer is an optimization that can be used to save memory and processing time.

When the share buffer is used, the past and present KV cache buffers point to the same memory block.

When the share buffer is not used, the present KV cache buffers are re-allocated before every forward pass of the model, and copied to the past KV cache buffers.

This is represented in the following diagram

![alt text](../../../images/past-present-share-buffer.png)

The size of the KV cache is different depending on the value of the past present share buffer.


## When past_present_share_buffer is true

Past KV caches = Present KV caches = batch_size * num_key_value_heads + max_length + head_size

Note that the size of the cache is largely determined the value of the max_length parameter.


## When past_present_share_buffer is false

Past KV caches = batch_size * num_key_value_heads * past_sequence_length * head_size

Present KV caches = batch_size * num_key_value_heads * past_sequence_length + 1 * head_size




4 changes: 3 additions & 1 deletion docs/genai/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@ Run generative AI models with ONNX Runtime.

See the source code here: [https://github.com/microsoft/onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai)

This library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.
This library provides the generative AI loop for ONNX models, including tokenization and other pre-processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.

Users can call a high level `generate()` method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop.

It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring.

Other supported features include applying chat templates and structured output (for tool calling)

Loading