Skip to content

Commit ea331ac

Browse files
committed
Start sections for C++ and C#
1 parent a7d2f1e commit ea331ac

File tree

1 file changed

+61
-9
lines changed

1 file changed

+61
-9
lines changed

docs/genai/howto/migrate.md

Lines changed: 61 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,26 +7,27 @@ grand_parent: Generate API (Preview)
77
nav_order: 5
88
---
99

10-
# Migrate ONNX Runtime generateA() API from 0.5.2 to 0.6.0
10+
# Migrate ONNX Runtime generate() API from 0.5.2 to 0.6.0
1111

1212
Learn how to migrate from ONNX Runtime generate() version 0.5.2 to version 0.6.0.
1313

14-
Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continous decoding_, and _interactive decoding_. The introduction of chat mode necessitated a change to the API, which breaks the previous API.
14+
Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continuous decoding_, and _interactive decoding_. With the introduction of chat mode, a breaking API change was made.
1515

16-
In summary, the new API adds an `AppendTokens` function to the generator, which allows for multi-turn conversations. Previously, input was set in `GeneratorParams` prior to the creation of the generator.
16+
In summary, the new API adds an `AppendTokens` method to the `Generator`, which allows for multi-turn conversations. Previously, input was set in `GeneratorParams` prior to the creation of the `Generator`.
1717

1818
Calling `AppendTokens` outside of the conversation loop can be used to implement system prompt caching.
1919

20-
Note: chat mode and system prompt caching are only supported for batch size 1. Furthermore, they are currently supported on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. They are not supported on NPU or GPUs running with the DirecML EP. For Q&A mode, the migrations described below *are* required.
20+
Note: chat mode and system prompt caching are only supported for batch size 1. Furthermore, they are currently supported on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. They are not supported on NPU or GPUs running with the DirecML EP. For question & answer (Q&A) mode, the migrations described below *are* still required.
2121

2222
## Python
2323

2424
### Migrate Python question and answer (single turn) code to 0.6.0
2525

2626
1. Replace calls to `params.input_ids = input_tokens` with `generator.append_tokens(input_tokens)` after the generator object has been created.
2727
2. Remove calls to `generator.compute_logits()`
28+
3. If the application has a Q&A loop, delete the `generator` between `append_token` call to reset the state of the model.
2829

29-
### Add system prompt caching
30+
### Add system prompt caching to Python applications
3031

3132
1. Create and tokenize the system prompt and call `generator.append_tokens(system_tokens)`. This call can be done before the user is asked for their prompt.
3233

@@ -35,7 +36,7 @@ Note: chat mode and system prompt caching are only supported for batch size 1. F
3536
generator.append_tokens(system_tokens)
3637
```
3738

38-
### Add chat mode
39+
### Add chat mode to Python applications
3940

4041
1. Create a loop in your application, and call `generator.append_tokens(prompt)` every time the user provides new input:
4142

@@ -54,20 +55,71 @@ Note: chat mode and system prompt caching are only supported for batch size 1. F
5455
print()
5556
```
5657

57-
## C/C++
58+
## C++
5859

59-
### Migrate C/C++ question and answer (single turn) code to 0.6.0
60+
### Migrate C++ question and answer (single turn) code to 0.6.0
6061

6162
1. Replace calls to `params->SetInputSequences(*sequences)` with `generator->AppendTokenSequences(*sequences)`
6263
2. Remove calls to `generator->ComputeLogits()`
6364

65+
### Add system prompt caching to C++ applications
66+
67+
1. Create and tokenize the system prompt and call `generator->AppendTokenSequences(*sequences)`. This call can be done before the user is asked for their prompt.
68+
69+
```c++
70+
auto sequences = OgaSequences::Create();
71+
tokenizer->Encode(system_prompt.c_str(), *sequences);
72+
generator->AppendTokenSequences(*sequences);
73+
generator.append_tokens(system_tokens)
74+
```
75+
76+
### Add chat mode to your C++ application
77+
78+
1. Add a chat loop to your application
79+
```c++
80+
std::cout << "Generating response..." << std::endl;
81+
auto params = OgaGeneratorParams::Create(*model);
82+
params->SetSearchOption("max_length", 1024);
83+
84+
auto generator = OgaGenerator::Create(*model, *params);
85+
86+
while (true) {
87+
std::string text;
88+
std::cout << "Prompt: " << std::endl;
89+
std::getline(std::cin, prompt);
90+
91+
auto sequences = OgaSequences::Create();
92+
tokenizer->Encode(prompt.c_str(), *sequences);
93+
94+
generator->AppendTokenSequences(*sequences);
95+
96+
while (!generator->IsDone()) {
97+
generator->GenerateNextToken();
98+
99+
const auto num_tokens = generator->GetSequenceCount(0);
100+
const auto new_token = generator->GetSequenceData(0)[num_tokens - 1];
101+
std::cout << tokenizer_stream->Decode(new_token) << std::flush;
102+
}
103+
}
104+
```
105+
64106
## C#
65107

66108
### Migrate C# question and answer (single turn) code to 0.6.0
67109

68-
1. Replace calls to `generatorParams.SetInputSequences(sequences)` with generator.AppendTokenSequences(sequences)`
110+
1. Replace calls to `generatorParams.SetInputSequences(sequences)` with `generator.AppendTokenSequences`(sequences)`
69111
2. Remove calls to `generator.ComputeLogits()`
70112

113+
### Add system prompt caching to your C# application
114+
115+
1. Create and tokenize the system prompt and call `generator->AppendTokenSequences()`. This call can be done before the user is asked for their prompt.
116+
117+
```csharp
118+
auto sequences = OgaSequences::Create();
119+
tokenizer->Encode(system_prompt.c_str(), *sequences);
120+
generator->AppendTokenSequences(*sequences);
121+
```
122+
71123
## Java
72124

73125
### Migrate Java question and answer (single turn) code to 0.6.0

0 commit comments

Comments
 (0)