Skip to content

Commit 375243e

Browse files
committed
Add migration guide
1 parent c688eff commit 375243e

File tree

1 file changed

+76
-0
lines changed

1 file changed

+76
-0
lines changed

docs/genai/howto/migrate.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
title: Migrate
3+
description: Learn how to migrate from one version of ONNX Runtime generate() API when there are breaking API changes
4+
has_children: false
5+
parent: How to
6+
grand_parent: Generate API (Preview)
7+
nav_order: 5
8+
---
9+
10+
# Migrate ONNX Runtime generateA() API from 0.5.2 to 0.6.0
11+
12+
Learn how to migrate from ONNX Runtime generate() version 0.5.2 to version 0.6.0.
13+
14+
Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continous decoding_, and _interactive decoding_. The introduction of chat mode necessitated a change to the API, which breaks the previous API.
15+
16+
In summary, the new API adds support for `AppendTokens`, which allows turn taking in the conversation. Previously, there was a simple API to `SetInputs`.
17+
18+
Calling `AddTokens` outside of the loop also adds support for system prompt caching.
19+
20+
Note: chat mode and system prompt caching is only supported when running on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. It is not supported on NPU or GPUs running with the DirecML EP. For Q&A mode, the migrations described below *are* required.
21+
22+
## Python
23+
24+
### Migrate Python question and answer (single turn) code to 0.6.0
25+
26+
1. Replace calls to `params.input_ids = input_tokens` with `generator.append_tokens(input_tokens)` after the generator object has been created.
27+
2. Remove calls to `generator.compute_logits()`
28+
29+
### Add system prompt caching
30+
31+
1. Create and tokenize the system prompt and call `generator.append_tokens(system_tokens)`. This call can be done before the user is asked for their prompt.
32+
33+
```python
34+
system_tokens = tokenizer.encode(system_prompt)
35+
generator.append_tokens(system_tokens)
36+
```
37+
38+
### Add chat mode
39+
40+
1. Create a loop in your application, and call `generator.append_tokens(prompt)` every time the user provides new input:
41+
42+
```python
43+
while True:
44+
user_input = input("Input: ")
45+
input_tokens = tokenizer.encode(user_input)
46+
generator.append_tokens(input_tokens)
47+
48+
while not generator.is_done():
49+
generator.generate_next_token()
50+
51+
new_token = generator.get_next_tokens()[0]
52+
print(tokenizer_stream.decode(new_token), end='', flush=True)
53+
except KeyboardInterrupt:
54+
print()
55+
```
56+
57+
## C/C++
58+
59+
### Migrate C/C++ question and answer (single turn) code to 0.6.0
60+
61+
1. Replace calls to `params->SetInputSequences(*sequences)` with `generator->AppendTokenSequences(*sequences)`
62+
2. Remove calls to `generator->ComputeLogits()`
63+
64+
## C#
65+
66+
### Migrate C# question and answer (single turn) code to 0.6.0
67+
68+
1. Replace calls to `generatorParams.SetInputSequences(sequences)` with generator.AppendTokenSequences(sequences)`
69+
2. Remove calls to `generator.ComputeLogits()`
70+
71+
## Java
72+
73+
### Migrate Java question and answer (single turn) code to 0.6.0
74+
75+
1. Replace calls to `GeneratorParams::setInput(sequences)` with `Generator::AppendTokenSequences`
76+
2. Remove calls to `Generator::ComputeLogits`

0 commit comments

Comments
 (0)