Skip to content

Commit 78c198f

Browse files
natkeTed Themistokleous
authored andcommitted
Add migration guide (microsoft#23482)
1 parent 77ca344 commit 78c198f

File tree

1 file changed

+160
-0
lines changed

1 file changed

+160
-0
lines changed

docs/genai/howto/migrate.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
---
2+
title: Migrate
3+
description: Learn how to migrate from one version of ONNX Runtime generate() API when there are breaking API changes
4+
has_children: false
5+
parent: How to
6+
grand_parent: Generate API (Preview)
7+
nav_order: 5
8+
---
9+
10+
# Migrate ONNX Runtime generate() API from 0.5.2 to 0.6.0
11+
12+
Learn how to migrate from ONNX Runtime generate() version 0.5.2 to version 0.6.0.
13+
14+
Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continuous decoding_, and _interactive decoding_. With the introduction of chat mode, a breaking API change was made.
15+
16+
In summary, the new API adds an `AppendTokens` method to the `Generator`, which allows for multi-turn conversations. Previously, input was set in `GeneratorParams` prior to the creation of the `Generator`.
17+
18+
Calling `AppendTokens` outside of the conversation loop can be used to implement system prompt caching.
19+
20+
Note: chat mode and system prompt caching are only supported for batch size 1. Furthermore, they are currently supported on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. They are not supported on NPU or GPUs running with the DirecML EP. For question & answer (Q&A) mode, the migrations described below *are* still required.
21+
22+
## Python
23+
24+
### Migrate Python question and answer (single turn) code to 0.6.0
25+
26+
1. Replace calls to `params.input_ids = input_tokens` with `generator.append_tokens(input_tokens)` after the generator object has been created.
27+
2. Remove calls to `generator.compute_logits()`
28+
3. If the application has a Q&A loop, delete the `generator` between `append_token` call to reset the state of the model.
29+
30+
### Add system prompt caching to Python applications
31+
32+
1. Create and tokenize the system prompt and call `generator.append_tokens(system_tokens)`. This call can be done before the user is asked for their prompt.
33+
34+
```python
35+
system_tokens = tokenizer.encode(system_prompt)
36+
generator.append_tokens(system_tokens)
37+
```
38+
39+
### Add chat mode to Python applications
40+
41+
1. Create a loop in your application, and call `generator.append_tokens(prompt)` every time the user provides new input:
42+
43+
```python
44+
while True:
45+
user_input = input("Input: ")
46+
input_tokens = tokenizer.encode(user_input)
47+
generator.append_tokens(input_tokens)
48+
49+
while not generator.is_done():
50+
generator.generate_next_token()
51+
52+
new_token = generator.get_next_tokens()[0]
53+
print(tokenizer_stream.decode(new_token), end='', flush=True)
54+
except KeyboardInterrupt:
55+
print()
56+
```
57+
58+
## C++
59+
60+
### Migrate C++ question and answer (single turn) code to 0.6.0
61+
62+
1. Replace calls to `params->SetInputSequences(*sequences)` with `generator->AppendTokenSequences(*sequences)`
63+
2. Remove calls to `generator->ComputeLogits()`
64+
65+
### Add system prompt caching to C++ applications
66+
67+
1. Create and tokenize the system prompt and call `generator->AppendTokenSequences(*sequences)`. This call can be done before the user is asked for their prompt.
68+
69+
```c++
70+
auto sequences = OgaSequences::Create();
71+
tokenizer->Encode(system_prompt.c_str(), *sequences);
72+
generator->AppendTokenSequences(*sequences);
73+
generator.append_tokens(system_tokens)
74+
```
75+
76+
### Add chat mode to your C++ application
77+
78+
1. Add a chat loop to your application
79+
```c++
80+
std::cout << "Generating response..." << std::endl;
81+
auto params = OgaGeneratorParams::Create(*model);
82+
params->SetSearchOption("max_length", 1024);
83+
84+
auto generator = OgaGenerator::Create(*model, *params);
85+
86+
while (true) {
87+
std::string text;
88+
std::cout << "Prompt: " << std::endl;
89+
std::getline(std::cin, prompt);
90+
91+
auto sequences = OgaSequences::Create();
92+
tokenizer->Encode(prompt.c_str(), *sequences);
93+
94+
generator->AppendTokenSequences(*sequences);
95+
96+
while (!generator->IsDone()) {
97+
generator->GenerateNextToken();
98+
99+
const auto num_tokens = generator->GetSequenceCount(0);
100+
const auto new_token = generator->GetSequenceData(0)[num_tokens - 1];
101+
std::cout << tokenizer_stream->Decode(new_token) << std::flush;
102+
}
103+
}
104+
```
105+
106+
## C#
107+
108+
### Migrate C# question and answer (single turn) code to 0.6.0
109+
110+
1. Replace calls to `generatorParams.SetInputSequences(sequences)` with `generator.AppendTokenSequences`(sequences)`
111+
2. Remove calls to `generator.ComputeLogits()`
112+
113+
### Add system prompt caching to your C# application
114+
115+
1. Create and tokenize the system prompt and call `generator->AppendTokenSequences()`. This call can be done before the user is asked for their prompt.
116+
117+
```csharp
118+
var systemPrompt = "..."
119+
auto sequences = OgaSequences::Create();
120+
tokenizer->Encode(systemPrompt, *sequences);
121+
generator->AppendTokenSequences(*sequences);
122+
```
123+
124+
### Add chat mode to your C# application
125+
126+
1. Add a chat loop to your application
127+
```csharp
128+
using var tokenizerStream = tokenizer.CreateStream();
129+
using var generator = new Generator(model, generatorParams);
130+
Console.WriteLine("Prompt:");
131+
prompt = Console.ReadLine();
132+
133+
// Example Phi-3 template
134+
var sequences = tokenizer.Encode($"<|user|>{prompt}<|end|><|assistant|>");
135+
136+
do
137+
{
138+
generator.AppendTokenSequences(sequences);
139+
var watch = System.Diagnostics.Stopwatch.StartNew();
140+
while (!generator.IsDone())
141+
{
142+
generator.GenerateNextToken();
143+
Console.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
144+
}
145+
Console.WriteLine();
146+
watch.Stop();
147+
var runTimeInSeconds = watch.Elapsed.TotalSeconds;
148+
var outputSequence = generator.GetSequence(0);
149+
var totalTokens = outputSequence.Length;
150+
Console.WriteLine($"Streaming Tokens: {totalTokens} Time: {runTimeInSeconds:0.00} Tokens per second: {totalTokens / runTimeInSeconds:0.00}");
151+
Console.WriteLine("Next prompt:");
152+
var nextPrompt = Console.ReadLine();
153+
sequences = tokenizer.Encode($"<|user|>{prompt}<|end|><|assistant|>");
154+
} while (prompt != null);
155+
156+
```
157+
158+
## Java
159+
160+
_Coming soon_

0 commit comments

Comments
 (0)