You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/genai/howto/migrate.md
+61-9Lines changed: 61 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,26 +7,27 @@ grand_parent: Generate API (Preview)
7
7
nav_order: 5
8
8
---
9
9
10
-
# Migrate ONNX Runtime generateA() API from 0.5.2 to 0.6.0
10
+
# Migrate ONNX Runtime generate() API from 0.5.2 to 0.6.0
11
11
12
12
Learn how to migrate from ONNX Runtime generate() version 0.5.2 to version 0.6.0.
13
13
14
-
Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continous decoding_, and _interactive decoding_. The introduction of chat mode necessitated a change to the API, which breaks the previous API.
14
+
Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continuous decoding_, and _interactive decoding_. With the introduction of chat mode, a breaking API change was made.
15
15
16
-
In summary, the new API adds an `AppendTokens`function to the generator, which allows for multi-turn conversations. Previously, input was set in `GeneratorParams` prior to the creation of the generator.
16
+
In summary, the new API adds an `AppendTokens`method to the `Generator`, which allows for multi-turn conversations. Previously, input was set in `GeneratorParams` prior to the creation of the `Generator`.
17
17
18
18
Calling `AppendTokens` outside of the conversation loop can be used to implement system prompt caching.
19
19
20
-
Note: chat mode and system prompt caching are only supported for batch size 1. Furthermore, they are currently supported on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. They are not supported on NPU or GPUs running with the DirecML EP. For Q&A mode, the migrations described below *are* required.
20
+
Note: chat mode and system prompt caching are only supported for batch size 1. Furthermore, they are currently supported on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. They are not supported on NPU or GPUs running with the DirecML EP. For question & answer (Q&A) mode, the migrations described below *are* still required.
21
21
22
22
## Python
23
23
24
24
### Migrate Python question and answer (single turn) code to 0.6.0
25
25
26
26
1. Replace calls to `params.input_ids = input_tokens` with `generator.append_tokens(input_tokens)` after the generator object has been created.
27
27
2. Remove calls to `generator.compute_logits()`
28
+
3. If the application has a Q&A loop, delete the `generator` between `append_token` call to reset the state of the model.
28
29
29
-
### Add system prompt caching
30
+
### Add system prompt caching to Python applications
30
31
31
32
1. Create and tokenize the system prompt and call `generator.append_tokens(system_tokens)`. This call can be done before the user is asked for their prompt.
32
33
@@ -35,7 +36,7 @@ Note: chat mode and system prompt caching are only supported for batch size 1. F
35
36
generator.append_tokens(system_tokens)
36
37
```
37
38
38
-
### Add chat mode
39
+
### Add chat mode to Python applications
39
40
40
41
1. Create a loop in your application, and call `generator.append_tokens(prompt)` every time the user provides new input:
41
42
@@ -54,20 +55,71 @@ Note: chat mode and system prompt caching are only supported for batch size 1. F
54
55
print()
55
56
```
56
57
57
-
## C/C++
58
+
## C++
58
59
59
-
### Migrate C/C++ question and answer (single turn) code to 0.6.0
60
+
### Migrate C++ question and answer (single turn) code to 0.6.0
60
61
61
62
1. Replace calls to `params->SetInputSequences(*sequences)`with`generator->AppendTokenSequences(*sequences)`
62
63
2. Remove calls to `generator->ComputeLogits()`
63
64
65
+
### Add system prompt caching to C++ applications
66
+
67
+
1. Create and tokenize the system prompt and call `generator->AppendTokenSequences(*sequences)`. This call can be done before the user is asked for their prompt.
### Migrate C# question and answer (single turn) code to 0.6.0
67
109
68
-
1. Replace calls to `generatorParams.SetInputSequences(sequences)`with generator.AppendTokenSequences(sequences)`
110
+
1. Replace calls to `generatorParams.SetInputSequences(sequences)`with`generator.AppendTokenSequences`(sequences)`
69
111
2. Remove calls to `generator.ComputeLogits()`
70
112
113
+
### Add system prompt caching to your C# application
114
+
115
+
1. Create and tokenize the system prompt and call `generator->AppendTokenSequences()`. This call can be done before the user is asked for their prompt.
0 commit comments