Skip to content

Commit d72c507

Browse files
committed
Address another round of review comments
1 parent a7c9ccf commit d72c507

File tree

3 files changed

+11
-158
lines changed

3 files changed

+11
-158
lines changed

docs/genai/howto/build-model.md

Lines changed: 1 addition & 153 deletions
Original file line numberDiff line numberDiff line change
@@ -8,158 +8,6 @@ nav_order: 3
88
---
99

1010
# Generate models using Model Builder
11-
{: .no_toc }
1211

13-
* TOC placeholder
14-
{:toc}
12+
Refer to [model builder guide](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md) for the latest documentation.
1513

16-
The model builder greatly accelerates creating optimized and quantized ONNX models that run with the ONNX Runtime generate() API.
17-
18-
## Current Support
19-
The tool currently supports the following model architectures.
20-
21-
- Gemma
22-
- LLaMA
23-
- Mistral
24-
- Phi
25-
26-
## Installation
27-
28-
Model builder is available as an [Olive](https://github.com/microsoft/olive) pass. It is also shipped as part of the onnxruntime-genai Python package. You can also download and run it standalone.
29-
30-
In any case, you need to have the following packages installed.
31-
32-
```bash
33-
pip install torch transformers onnx onnxruntime
34-
```
35-
36-
### Install from package
37-
38-
```bash
39-
pip install --pre onnxruntime-genai
40-
```
41-
42-
#### Direct download
43-
44-
```bash
45-
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py -o builder.py
46-
```
47-
48-
### Usage
49-
50-
For all available options, please use the `-h/--help` flag.
51-
52-
```bash
53-
# From wheel:
54-
python3 -m onnxruntime_genai.models.builder --help
55-
56-
# From source:
57-
python3 builder.py --help
58-
```
59-
60-
### Original PyTorch Model from HuggingFace
61-
62-
This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
63-
64-
```bash
65-
66-
# From wheel:
67-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
68-
69-
# From source:
70-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
71-
```
72-
73-
### Original PyTorch Model from Disk
74-
75-
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
76-
```
77-
# From wheel:
78-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
79-
80-
# From source:
81-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
82-
```
83-
84-
### Customized or Finetuned PyTorch Model
85-
This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face.
86-
```
87-
# From wheel:
88-
python3 -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider
89-
90-
# From source:
91-
python3 builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider
92-
```
93-
94-
### GGUF Model
95-
This scenario is where your float16/float32 GGUF model is already on disk.
96-
```
97-
# From wheel:
98-
python3 -m onnxruntime_genai.models.builder -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
99-
100-
# From source:
101-
python3 builder.py -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
102-
```
103-
104-
### Extra Options
105-
This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to `--extra_options`.
106-
```
107-
# From wheel:
108-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
109-
110-
# From source:
111-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
112-
```
113-
To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above.
114-
115-
### Config Only
116-
This scenario is for when you already have your optimized and/or quantized ONNX model and you need to create the config files to run with ONNX Runtime generate() API.
117-
```
118-
# From wheel:
119-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
120-
121-
# From source:
122-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
123-
```
124-
125-
Afterwards, please open the `genai_config.json` file in the output folder and modify the fields as needed for your model. You should store your ONNX model in the output folder as well.
126-
127-
### Unit Testing Models
128-
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it.
129-
130-
```
131-
from transformers import AutoModelForCausalLM, AutoTokenizer
132-
133-
model_name = "your_model_name"
134-
cache_dir = "cache_dir_to_save_hf_files"
135-
136-
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
137-
model.save_pretrained(cache_dir)
138-
139-
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
140-
tokenizer.save_pretrained(cache_dir)
141-
```
142-
143-
#### Option 1: Use the model builder tool directly
144-
This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers.
145-
```
146-
# From wheel:
147-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
148-
149-
# From source:
150-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
151-
```
152-
153-
#### Option 2: Edit the config.json file on disk and then run the model builder tool
154-
155-
1. Navigate to where the PyTorch model and its associated files are saved on disk.
156-
2. Modify `num_hidden_layers` in `config.json` to your desired target (e.g. 4 layers).
157-
3. Run the below command for the model builder tool.
158-
159-
```
160-
# From wheel:
161-
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
162-
163-
# From source:
164-
python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
165-
```

docs/genai/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,11 @@ Run generative AI models with ONNX Runtime.
1313

1414
See the source code here: [https://github.com/microsoft/onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai)
1515

16-
This library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.
16+
This library provides the generative AI loop for ONNX models, including tokenization and other pre-processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.
1717

1818
Users can call a high level `generate()` method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop.
1919

2020
It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring.
2121

22+
Other supported features include applying chat templates and structured output (for tool calling)
23+

docs/genai/reference/config.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,14 +125,16 @@ Describes the model architecture, files, and tokenization.
125125
- qwen2
126126
- qwen3
127127

128-
For decoder only LLMS that are split into a pipeline of models, use "decoder-pipeline".
128+
For decoder-only LLMs that are split into a pipeline of models, use "decoder-pipeline".
129129

130-
Other model types:
130+
For encoder-decoder models:
131131
- whisper
132+
- marian-ssru
133+
134+
For multi-modal model types:
132135
- phi3v
133136
- phi4mm
134137
- gemma3
135-
- marian-ssru
136138

137139
- **pad_token_id**: *(int)*
138140
The id of the padding token.
@@ -550,12 +552,13 @@ Describes the generation/search parameters.
550552
1. **Beam search**
551553
- `num_beams > 1`
552554
- `do_sample = false`
555+
- `past_present_share_buffer = false`
553556

554557
2. **Greedy search**
555558
- `num_beams = 1`
556559
- `do_sample = false`
557560

558-
3. **Top P / Top K**
561+
3. **Random sampling with Top P / Top K**
559562
- `do_sample = true`
560563

561564
---

0 commit comments

Comments
 (0)