[Feature Request] Respect `tensor_split` for GPU visibility in Docker + NVIDIA passthrough #9623

okcodemaybe · 2026-04-30T16:23:10Z

okcodemaybe
Apr 30, 2026

📌 Summary

When running LocalAI in a Docker container with NVIDIA GPU passthrough, the inference backend (e.g., llama.cpp) automatically initializes across all visible GPUs in the container, even when tensor_split is explicitly configured to partition the model across only specific devices. This leads to unnecessary GPU context creation, VRAM reservation, and potential instability in multi-GPU container deployments.

🔍 Current Behavior

Container is started with --gpus all or equivalent NVIDIA passthrough.
Model configuration includes tensor_split: "0,1" (or similar) to target specific GPUs.
Backend detects and loads cuda/metal/vulkan contexts on every available GPU in the container, regardless of the tensor_split directive.
Unused GPUs may still consume VRAM, create CUDA contexts, or interfere with other container workloads.

✅ Expected Behavior

The backend should only initialize and expose GPUs that are explicitly referenced in the tensor_split configuration. For example:

tensor_split: "0,1" → Only GPUs 0 and 1 are initialized.
tensor_split: "0" → Only GPU 0 is initialized.
If tensor_split is omitted or set to auto, fall back to current behavior (use all available GPUs).

🛠 Technical Context

Environment: Docker + nvidia-container-toolkit / --gpus flag
Backend: llama.cpp (or GGUF-compatible inference engine)
Config Syntax: tensor_split: "0,1"
OS: Linux (typically)
CUDA/NVIDIA Driver: Compatible with container runtime

💡 Why This Matters

Resource Efficiency: Unnecessary GPU initialization reserves VRAM and CPU threads, even if no tensors are assigned to those devices.
Multi-Model/Workload Isolation: Users often pass through multiple GPUs to containers for flexibility but only want to use a subset per model. Running multiple instances of LocalAI to run different models for the endpoint is antithetical to the use-case of the project.
Stability: Some NVIDIA drivers/CUDA versions behave unpredictably when multiple contexts are created on unused devices.
Declarative Config Alignment: tensor_split should act as both a partitioning directive and a visibility filter.

🔧 Suggested Implementation

Parse tensor_split early in the backend initialization phase.
Dynamically set CUDA_VISIBLE_DEVICES (or equivalent) based on the referenced GPU indices before loading the inference engine.
Alternatively, introduce a dedicated config field like visible_gpus: "0,1" that overrides auto-detection while keeping tensor_split for weight distribution.
Log which GPUs are active/inactive for transparency:
INFO: tensor_split targets GPUs [0,1]. Restricting CUDA_VISIBLE_DEVICES accordingly.

🔄 Current Workaround

Manually set CUDA_VISIBLE_DEVICES in the Docker run command or environment file:

docker run --gpus all -e CUDA_VISIBLE_DEVICES=0,1 ...

This works but is not per-model and requires manual orchestration, defeating the purpose of declarative .yaml configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Respect `tensor_split` for GPU visibility in Docker + NVIDIA passthrough #9623

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

[Feature Request] Respect tensor_split for GPU visibility in Docker + NVIDIA passthrough #9623

Uh oh!

okcodemaybe Apr 30, 2026

📌 Summary

🔍 Current Behavior

✅ Expected Behavior

🛠 Technical Context

💡 Why This Matters

🔧 Suggested Implementation

🔄 Current Workaround

Replies: 0 comments

[Feature Request] Respect `tensor_split` for GPU visibility in Docker + NVIDIA passthrough #9623

okcodemaybe
Apr 30, 2026