Skip to content

The NPU output length is set to 2048 by default. #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions modules/ollama_openvino/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -584,17 +584,17 @@ Getting started with large language models and using the [GenAI](https://github.
We provide two ways to download the executable file of Ollama, one is to download it from Google Drive, and the other is to download it from Baidu Drive:
## Google Driver
### Windows
[Download exe](https://drive.google.com/file/d/1Sep1IdGn7mJaE8PCXKYxp_aj1ljiPvpN/view?usp=sharing) + [Download OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250320/openvino_genai_windows_2025.2.0.0.dev20250320_x86_64.zip)
[Download exe](https://drive.google.com/file/d/1Xo3ohbfC852KtJy_4xtn_YrYaH4Y_507/view?usp=sharing) + [Download OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250513/openvino_genai_windows_2025.2.0.0.dev20250513_x86_64.zip)

### Linux(Ubuntu 22.04)
[Download](https://drive.google.com/file/d/1DdBoEGp_eoyJPbpMGVbEivihYSKrCMGt/view?usp=sharing) + [Donwload OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250320/openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64.tar.gz)
[Download](https://drive.google.com/file/d/1_P7CQqFUqeyx4q5y5bQ-xQsb10T9gzJD/view?usp=sharing) + [Donwload OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250513/openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64.tar.gz)

## 百度云盘
### Windows
[Download exe](https://pan.baidu.com/s/1TCH7rYSPr8jQDHLvCeXdLw?pwd=6bk9) + [Download OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250320/openvino_genai_windows_2025.2.0.0.dev20250320_x86_64.zip)
[Download exe](https://pan.baidu.com/s/1uIUjji7Mxf594CJy1vbrVw?pwd=36mq) + [Download OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250513/openvino_genai_windows_2025.2.0.0.dev20250513_x86_64.zip)

### Linux(Ubuntu 22.04)
[Download](https://pan.baidu.com/s/1UVO0ZK4DFTjTwfarQ8LUIw?pwd=pxkd) + [Donwload OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250320/openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64.tar.gz)
[Download](https://pan.baidu.com/s/1OCq3aKJBiCrtjLKa7kXbMw?pwd=exhz) + [Donwload OpenVINO GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250513/openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64.tar.gz)

## Docker
### Linux
Expand Down Expand Up @@ -725,7 +725,7 @@ Let's take [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://hf-mirror.com/deeps

4. Unzip OpenVINO GenAI package and set environment
```shell
cd openvino_genai_windows_2025.2.0.0.dev20250320_x86_64
cd openvino_genai_windows_2025.2.0.0.dev20250513_x86_64
setupvars.bat
```

Expand Down Expand Up @@ -802,9 +802,9 @@ Then build and run Ollama from the root directory of the repository:

3. Initialize the GenAI environment

Download GenAI runtime from [GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250320/openvino_genai_windows_2025.2.0.0.dev20250320_x86_64.zip), then extract it to a directory openvino_genai_windows_2025.2.0.0.dev20250320_x86_64 .
Download GenAI runtime from [GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250513/openvino_genai_windows_2025.2.0.0.dev20250513_x86_64.zip), then extract it to a directory openvino_genai_windows_2025.2.0.0.dev20250513_x86_64.
```shell
cd openvino_genai_windows_2025.2.0.0.dev20250320_x86_64
cd openvino_genai_windows_2025.2.0.0.dev20250513_x86_64
setupvars.bat
```

Expand All @@ -819,7 +819,7 @@ Then build and run Ollama from the root directory of the repository:
go build -o ollama.exe
```

6. If you don't want to recompile ollama, you can choose to directly use the compiled executable file, and then initialize the genai environment in `step 3` to run ollama directly [ollama](https://drive.google.com/file/d/1iizO9iLhSJGFUu6BgY3EwOchrCyzImUN/view?usp=drive_link).
6. If you don't want to recompile ollama, you can choose to directly use the compiled executable file, and then initialize the genai environment in `step 3` to run ollama directly.

But if you encounter the error when executing ollama.exe, it is recommended that you recompile from source code.
```shell
Expand All @@ -840,9 +840,9 @@ Then build and run Ollama from the root directory of the repository:

3. Initialize the GenAI environment

Download GenAI runtime from [GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250320/openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64.tar.gz), then extract it to a directory openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64.
Download GenAI runtime from [GenAI](https://storage.openvinotoolkit.org/repositories/openvino_genai/packages/nightly/2025.2.0.0.dev20250513/openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64.tar.gz), then extract it to a directory openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64.
```shell
cd openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64
cd openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64
source setupvars.sh
```

Expand Down
20 changes: 19 additions & 1 deletion modules/ollama_openvino/genai/genai.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,19 @@ package genai
typedef int (*callback_function)(const char*, void*);

extern int goCallbackBridge(char* input, void* ptr);

static ov_status_e ov_genai_llm_pipeline_create_npu_output_2048(const char* models_path,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be hardcoded to 2048. It should updated to take a variable max_prompt_len and min_response_len so that larger range of sizes can be supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gblong1
To my knowledge, the default output length for the NPU is 1024, which is not dynamically adjustable. This can be referenced in the following link:
genai npu default output len
The decision to set the default to 2048 was made because during testing, it was observed that some responses were being truncated, which is not the intended behavior. This adjustment ensures that responses remain complete and aligned with expectations.

thank you.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be made larger, however, it should not be hard coded, it should be configurable so that it doesn't have to be re-hard coded to a bigger number later, and so that some models which need even larger than 2048 (like deepseek) will also work well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that having a dynamically adjustable output length would be ideal. However, the behavior of the NPU differs from that of the CPU/GPU. While the output length for the CPU/GPU can be set during generation, the NPU requires the output length to be specified at the time of model loading. Once the model is loaded, the output length cannot be modified without reloading the model.
At this stage, making the NPU's output length dynamically adjustable would necessitate reloading the model and involve significant changes. Therefore, I believe setting the output length to 2048 is a practical solution that should meet most common use cases.

const char* device,
ov_genai_llm_pipeline** pipe) {
return ov_genai_llm_pipeline_create(models_path, "NPU", 4, pipe, "MAX_PROMPT_LEN", "2048", "MIN_RESPONSE_LEN", "256");
}

static ov_status_e ov_genai_llm_pipeline_create_cgo(const char* models_path,
const char* device,
ov_genai_llm_pipeline** pipe) {
return ov_genai_llm_pipeline_create(models_path, device, 0, pipe);
}

*/
import "C"

Expand Down Expand Up @@ -111,7 +124,12 @@ func CreatePipeline(modelsPath string, device string) *C.ov_genai_llm_pipeline {
defer C.free(unsafe.Pointer(cModelsPath))
defer C.free(unsafe.Pointer(cDevice))

C.ov_genai_llm_pipeline_create(cModelsPath, cDevice, &pipeline)
// C.ov_genai_llm_pipeline_create(cModelsPath, cDevice, &pipeline)
if device == "NPU" {
C.ov_genai_llm_pipeline_create_npu_output_2048(cModelsPath, cDevice, &pipeline)
} else {
C.ov_genai_llm_pipeline_create_cgo(cModelsPath, cDevice, &pipeline)
}
return pipeline
}

Expand Down
31 changes: 31 additions & 0 deletions modules/ollama_openvino/llm/genai/genaiserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,36 @@ func SelectDevice(device string, supportedDevices []string) string {
return device
}

func addIndexToDuplicates(input []string) []string {
// output := make([]string, 0, len(input))
var output []string
counters := make(map[string]int) // Used to record the occurrence count of each value
duplicates := make(map[string]bool) // Used to mark which values are duplicates

// First pass: Count the occurrences of each value and mark duplicates
for _, item := range input {
counters[item]++
if counters[item] > 1 {
duplicates[item] = true
}
}

// Second pass: Add an index to duplicate values
for _, item := range input {
if duplicates[item] { // If it's a duplicate
output = append(output, fmt.Sprintf("%s:%d", item, counters[item]-1))
counters[item]-- // Update the counter
} else { // If it's not a duplicate
output = append(output, item)
}
}
if ContainsInSlice(input, "GPU") {
output = append(output, "GPU")
}

return output
}

// NewGenaiServer will run a server
func NewGenaiServer(gpus discover.GpuInfoList, model string, modelname string, inferdevice string, f *ggml.GGML, adapters, projectors []string, opts api.Options, numParallel int) (GenaiServer, error) {
systemInfo := discover.GetSystemInfo()
Expand All @@ -121,6 +151,7 @@ func NewGenaiServer(gpus discover.GpuInfoList, model string, modelname string, i
for i := 0; i < int(len(genai_device)); i++ {
genai_device_list = append(genai_device_list, genai_device[i]["device_name"])
}
genai_device_list = addIndexToDuplicates(genai_device_list)
inferdevice = SelectDevice(inferdevice, genai_device_list)

params := []string{
Expand Down
Loading