|
| 1 | +--- |
| 2 | +title: Migrate |
| 3 | +description: Learn how to migrate from one version of ONNX Runtime generate() API when there are breaking API changes |
| 4 | +has_children: false |
| 5 | +parent: How to |
| 6 | +grand_parent: Generate API (Preview) |
| 7 | +nav_order: 5 |
| 8 | +--- |
| 9 | + |
| 10 | +# Migrate ONNX Runtime generateA() API from 0.5.2 to 0.6.0 |
| 11 | + |
| 12 | +Learn how to migrate from ONNX Runtime generate() version 0.5.2 to version 0.6.0. |
| 13 | + |
| 14 | +Version 0.6.0 adds support for "chat mode", also known as _continuation_, _continous decoding_, and _interactive decoding_. The introduction of chat mode necessitated a change to the API, which breaks the previous API. |
| 15 | + |
| 16 | +In summary, the new API adds support for `AppendTokens`, which allows turn taking in the conversation. Previously, there was a simple API to `SetInputs`. |
| 17 | + |
| 18 | +Calling `AddTokens` outside of the loop also adds support for system prompt caching. |
| 19 | + |
| 20 | +Note: chat mode and system prompt caching is only supported when running on CPU, NVIDIA GPUs with the CUDA EP, and all GPUs with the Web GPU native EP. It is not supported on NPU or GPUs running with the DirecML EP. For Q&A mode, the migrations described below *are* required. |
| 21 | + |
| 22 | +## Python |
| 23 | + |
| 24 | +### Migrate Python question and answer (single turn) code to 0.6.0 |
| 25 | + |
| 26 | +1. Replace calls to `params.input_ids = input_tokens` with `generator.append_tokens(input_tokens)` after the generator object has been created. |
| 27 | +2. Remove calls to `generator.compute_logits()` |
| 28 | + |
| 29 | +### Add system prompt caching |
| 30 | + |
| 31 | +1. Create and tokenize the system prompt and call `generator.append_tokens(system_tokens)`. This call can be done before the user is asked for their prompt. |
| 32 | + |
| 33 | + ```python |
| 34 | + system_tokens = tokenizer.encode(system_prompt) |
| 35 | + generator.append_tokens(system_tokens) |
| 36 | + ``` |
| 37 | + |
| 38 | +### Add chat mode |
| 39 | + |
| 40 | +1. Create a loop in your application, and call `generator.append_tokens(prompt)` every time the user provides new input: |
| 41 | + |
| 42 | + ```python |
| 43 | + while True: |
| 44 | + user_input = input("Input: ") |
| 45 | + input_tokens = tokenizer.encode(user_input) |
| 46 | + generator.append_tokens(input_tokens) |
| 47 | + |
| 48 | + while not generator.is_done(): |
| 49 | + generator.generate_next_token() |
| 50 | + |
| 51 | + new_token = generator.get_next_tokens()[0] |
| 52 | + print(tokenizer_stream.decode(new_token), end='', flush=True) |
| 53 | + except KeyboardInterrupt: |
| 54 | + print() |
| 55 | + ``` |
| 56 | + |
| 57 | +## C/C++ |
| 58 | + |
| 59 | +### Migrate C/C++ question and answer (single turn) code to 0.6.0 |
| 60 | + |
| 61 | +1. Replace calls to `params->SetInputSequences(*sequences)` with `generator->AppendTokenSequences(*sequences)` |
| 62 | +2. Remove calls to `generator->ComputeLogits()` |
| 63 | + |
| 64 | +## C# |
| 65 | + |
| 66 | +### Migrate C# question and answer (single turn) code to 0.6.0 |
| 67 | + |
| 68 | +1. Replace calls to `generatorParams.SetInputSequences(sequences)` with generator.AppendTokenSequences(sequences)` |
| 69 | +2. Remove calls to `generator.ComputeLogits()` |
| 70 | + |
| 71 | +## Java |
| 72 | + |
| 73 | +### Migrate Java question and answer (single turn) code to 0.6.0 |
| 74 | + |
| 75 | +1. Replace calls to `GeneratorParams::setInput(sequences)` with `Generator::AppendTokenSequences` |
| 76 | +2. Remove calls to `Generator::ComputeLogits` |
0 commit comments