Skip to content

Commit d47aea1

Browse files
authored
Update config spec (#24913)
1 parent 67491a3 commit d47aea1

File tree

11 files changed

+521
-235
lines changed

11 files changed

+521
-235
lines changed

docs/build/web.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how to build ONNX Runtime from source to deploy on the web
55
nav_order: 4
66
redirect_from: /docs/how-to/build/web
77
---
8-
8+
Cloud to the Edge – This layer ensures flexibility and performance wherever your workloads run. Foundry is designed to extend seamlessly from the cloud to the edge, and Foundry Local is already live on hundreds of millions of Windows (and Mac) devices.
99
# Build ONNX Runtime for Web
1010
{: .no_toc }
1111

@@ -168,7 +168,7 @@ This is the last stage in the build process, please follow the sections in a seq
168168

169169
- Download artifacts from pipeline manually.
170170

171-
you can download prebuilt WebAssembly artifacts from [Windows WebAssembly CI Pipeline](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=161&_a=summary). Select a build, download artifact "Release_wasm" and unzip. See instructions below to put files into destination folders.
171+
you can download prebuilt WebAssembly artifacts from [Windows WebAssembly CI Pipeline](https://github.com/microsoft/onnxruntime/actions/workflows/web.yml). Select a build, download artifact "Release_wasm" and unzip. See instructions below to put files into destination folders.
172172

173173
- Build WebAssembly artifacts.
174174

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Currently TensorRT RTX supports RTX GPUs from Ampere or later architectures. Sup
2929
Please select the Nvidia TensorRT RTX version of Onnx Runtime: https://onnxruntime.ai/docs/install. (TODO!)
3030

3131
## Build from source
32-
See [Build instructions](../build/eps.md#TensorRT-RTX).
32+
See [Build instructions](../build/eps.md#tensorrt-rtx).
3333

3434
## Requirements
3535

docs/extensions/index.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ has_children: true
44
nav_order: 7
55
---
66

7-
# ONNXRuntime-Extensions
7+
# ONNX Runtime Extensions
88

9-
[![Build Status](https://dev.azure.com/onnxruntime/onnxruntime/_apis/build/status%2Fmicrosoft.onnxruntime-extensions?branchName=main)](https://dev.azure.com/onnxruntime/onnxruntime/_build/latest?definitionId=209&branchName=main)
109

11-
ONNXRuntime-Extensions is a library that extends the capability of the ONNX models and inference with ONNX Runtime, via the ONNX Runtime custom operator interface. It includes a set of Custom Operators to support common model pre and post-processing for audio, vision, text, and language models. As with ONNX Runtime, Extensions also supports multiple languages and platforms (Python on Windows/Linux/macOS, Android and iOS mobile platforms and Web assembly for web).
10+
ONNX Runtime Extensions is a library that extends the capability of the ONNX models and inference with ONNX Runtime, via the ONNX Runtime custom operator interface. It includes a set of Custom Operators to support common model pre and post-processing for audio, vision, text, and language models. As with ONNX Runtime, Extensions also supports multiple languages and platforms (Python on Windows/Linux/macOS, Android and iOS mobile platforms and Web assembly for web).
1211

1312
The basic workflow is to add the custom operators to an ONNX model and then to perform inference on the enhanced model with ONNX Runtime and ONNXRuntime-Extensions packages.
1413

docs/genai/howto/build-model.md

Lines changed: 1 addition & 153 deletions
Original file line numberDiff line numberDiff line change
@@ -8,158 +8,6 @@ nav_order: 3
88
---
99

1010
# Generate models using Model Builder
11-
{: .no_toc }
1211

13-
* TOC placeholder
14-
{:toc}
12+
Refer to [model builder guide](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md) for the latest documentation.
1513

16-
The model builder greatly accelerates creating optimized and quantized ONNX models that run with the ONNX Runtime generate() API.
17-
18-
## Current Support
19-
The tool currently supports the following model architectures.
20-
21-
- Gemma
22-
- LLaMA
23-
- Mistral
24-
- Phi
25-
26-
## Installation
27-
28-
Model builder is available as an [Olive](https://github.com/microsoft/olive) pass. It is also shipped as part of the onnxruntime-genai Python package. You can also download and run it standalone.
29-
30-
In any case, you need to have the following packages installed.
31-
32-
```bash
33-
pip install torch transformers onnx onnxruntime
34-
```
35-
36-
### Install from package
37-
38-
```bash
39-
pip install --pre onnxruntime-genai
40-
```
41-
42-
#### Direct download
43-
44-
```bash
45-
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py -o builder.py
46-
```
47-
48-
### Usage
49-
50-
For all available options, please use the `-h/--help` flag.
51-
52-
```bash
53-
# From wheel:
54-
python3 -m onnxruntime_genai.models.builder --help
55-
56-
# From source:
57-
python3 builder.py --help
58-
```
59-
60-
### Original PyTorch Model from HuggingFace
61-
62-
This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
63-
64-
```bash
65-
66-
# From wheel:
67-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
68-
69-
# From source:
70-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
71-
```
72-
73-
### Original PyTorch Model from Disk
74-
75-
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
76-
```
77-
# From wheel:
78-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
79-
80-
# From source:
81-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
82-
```
83-
84-
### Customized or Finetuned PyTorch Model
85-
This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face.
86-
```
87-
# From wheel:
88-
python3 -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider
89-
90-
# From source:
91-
python3 builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider
92-
```
93-
94-
### GGUF Model
95-
This scenario is where your float16/float32 GGUF model is already on disk.
96-
```
97-
# From wheel:
98-
python3 -m onnxruntime_genai.models.builder -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
99-
100-
# From source:
101-
python3 builder.py -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
102-
```
103-
104-
### Extra Options
105-
This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to `--extra_options`.
106-
```
107-
# From wheel:
108-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
109-
110-
# From source:
111-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
112-
```
113-
To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above.
114-
115-
### Config Only
116-
This scenario is for when you already have your optimized and/or quantized ONNX model and you need to create the config files to run with ONNX Runtime generate() API.
117-
```
118-
# From wheel:
119-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
120-
121-
# From source:
122-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
123-
```
124-
125-
Afterwards, please open the `genai_config.json` file in the output folder and modify the fields as needed for your model. You should store your ONNX model in the output folder as well.
126-
127-
### Unit Testing Models
128-
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it.
129-
130-
```
131-
from transformers import AutoModelForCausalLM, AutoTokenizer
132-
133-
model_name = "your_model_name"
134-
cache_dir = "cache_dir_to_save_hf_files"
135-
136-
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
137-
model.save_pretrained(cache_dir)
138-
139-
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
140-
tokenizer.save_pretrained(cache_dir)
141-
```
142-
143-
#### Option 1: Use the model builder tool directly
144-
This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers.
145-
```
146-
# From wheel:
147-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
148-
149-
# From source:
150-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
151-
```
152-
153-
#### Option 2: Edit the config.json file on disk and then run the model builder tool
154-
155-
1. Navigate to where the PyTorch model and its associated files are saved on disk.
156-
2. Modify `num_hidden_layers` in `config.json` to your desired target (e.g. 4 layers).
157-
3. Run the below command for the model builder tool.
158-
159-
```
160-
# From wheel:
161-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
162-
163-
# From source:
164-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
165-
```
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Past present share buffer
3+
description: How to configure the past present share buffer using the ONNX Runtime generate() API
4+
has_children: false
5+
parent: How to
6+
grand_parent: Generate API (Preview)
7+
nav_order: 6
8+
---
9+
10+
# How to configure the past present share buffer
11+
12+
The past present share buffer is an optimization that can be used to save memory and processing time.
13+
14+
When buffer sharing is used, the past and present KV cache buffers point to the same memory block.
15+
16+
When buffer sharing is not used, the present KV cache buffers are re-allocated before every forward pass of the model and copied to the past KV cache buffers.
17+
18+
This is represented in the following diagram
19+
20+
![alt text](../../../images/past-present-share-buffer.png)
21+
22+
The size of the KV cache depends on whether buffer sharing is enabled or disabled.
23+
24+
25+
## Size of KV caches
26+
27+
### When past_present_share_buffer is true
28+
29+
Size of past KV caches = size of present KV caches (bytes)
30+
31+
$batch\_size * num\_key\_value\_heads * max\_length * head\_size
32+
$
33+
34+
For example, for the [4-bit quantized Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx) model, with a batch size of 1 and a max length of 4k, the size of the cache is: $1 * 8 * 4096 * 128 = 4GB$
35+
36+
37+
Note that the size of the cache is largely determined the value of the max_length parameter.
38+
39+
40+
### When past_present_share_buffer is false
41+
42+
Size of past KV caches (bytes) = $batch\_size * num\_key\_value\_heads * past\_sequence\_length * head\_size$
43+
44+
Size of present KV caches (bytes) = $batch\_size * num\_key\_value\_heads * (past\_sequence\_length + 1) * head\_size$
45+
46+
For example, for the [4-bit quantized DeepSeek R1 Qwen 1.5B](https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX) model, with a batch size of 1 and a past sequence length of 1k, the size of the past cache is: $1 * 2 * 1024 * 128 = 256M$ and the size of the present cache is: $1 * 2 * 1025 * 128 = 257M$
47+
48+
49+
50+

docs/genai/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,11 @@ Run generative AI models with ONNX Runtime.
1313

1414
See the source code here: [https://github.com/microsoft/onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai)
1515

16-
This library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.
16+
This library provides the generative AI loop for ONNX models, including tokenization and other pre-processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.
1717

1818
Users can call a high level `generate()` method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop.
1919

2020
It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring.
2121

22+
Other supported features include applying chat templates and structured output (for tool calling)
23+

0 commit comments

Comments
 (0)