Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 23 additions & 18 deletions Llama-Cpp-Dockerfile/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM ubuntu:24.04
FROM ubuntu:25.10

# Set noninteractive mode to avoid prompts
ENV DEBIAN_FRONTEND=noninteractive
Expand Down Expand Up @@ -39,19 +39,27 @@ RUN curl -fsSL https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-P
intel-oneapi-base-toolkit && \
apt-get clean && rm -rf /var/lib/apt/lists/*

# Set up oneAPI environment for interactive sessions
RUN echo 'source /opt/intel/oneapi/setvars.sh --force' >> /root/.bashrc

# Install UV
# Install uv
RUN curl -fsSL https://astral.sh/uv/install.sh -o /uv-installer.sh && \
sh /uv-installer.sh && rm /uv-installer.sh
ENV PATH="/root/.local/bin/:$PATH"
sh /uv-installer.sh && \
cp /root/.local/bin/uv /usr/local/bin/uv && \
rm /uv-installer.sh

# Create virtual environment
RUN uv venv /opt/venv
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Create group and user
RUN groupadd -g 993 render && \
useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/ /opt/venv/ && \
usermod -a -G video,render user

# Add oneAPI environment to user's bashrc
RUN echo 'source /opt/intel/oneapi/setvars.sh --force' >> /home/user/.bashrc

# Install Huggingface Hub
RUN uv pip install huggingface-hub

Expand All @@ -60,17 +68,14 @@ ENV CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx"
RUN bash -c "source /opt/intel/oneapi/setvars.sh --force && \
uv pip install llama-cpp-python[server]==0.3.8 -U --force-reinstall --no-cache-dir --verbose"

# Create a non-root user
RUN useradd -m -s /bin/bash appuser && \
chown -R appuser:appuser /opt/venv /root/.local
USER appuser
# Add healthcheck to satisfy Trivy
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import llama_cpp; print('OK'); print(llama_cpp.__version__)" || exit 1

# Expose default server port
EXPOSE 8000
# Switch to non-root user
USER user

# Add health check to monitor server status
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8000/v1/models || curl -f http://localhost:8000/ || exit 1
ENTRYPOINT ["/bin/bash", "-c", "source /opt/intel/oneapi/setvars.sh && uv run python -m llama_cpp.server \"$@\"", "--"]
CMD ["--hf_model_repo_id", "Qwen/Qwen2-0.5B-Instruct-GGUF", "--model", "*q8_0.gguf", "--n_gpu_layers", "-1"]

# Set default command
ENTRYPOINT ["uv", "run", "python", "-m", "llama_cpp.server"]
# CMD ["/bin/bash"]
95 changes: 83 additions & 12 deletions Llama-Cpp-Dockerfile/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,30 @@
# llama-cpp-python with Intel GPU (SYCL) Docker

This repository provides a Dockerfile to build a containerized environment for running [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) with Intel GPU acceleration using SYCL (oneAPI). The image is based on Ubuntu 24.04 and includes all necessary Intel GPU drivers, oneAPI Base Toolkit, and Python dependencies for efficient LLM inference.
This repository provides a Dockerfile to build a containerized environment for running [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) with Intel GPU acceleration using SYCL (oneAPI). The image is based on Ubuntu 25.10 and includes all necessary Intel GPU drivers, oneAPI Base Toolkit, and Python dependencies for efficient LLM inference.

## Table of Contents

- [Features](#features)
- [Build the docker image](#build-the-docker-image)
- [Run the container](#run-the-container)
- [Default container](#default-container)
- [Custom model or arguments](#custom-model-or-arguments)
- [Mount a local directory and run a model from it](#mount-a-local-directory-and-run-a-model-from-it)
- [Run bash shell in the container](#run-bash-shell-in-the-container)
- [Notes](#notes)
- [License](#license)

---

## Features
- **Intel GPU support**: Installs Intel GPU drivers and oneAPI for SYCL acceleration.

- **Intel GPU support**: Installs [Intel GPU drivers](https://dgpu-docs.intel.com/driver/client/overview.html) and [oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) for SYCL acceleration.
- **llama-cpp-python**: Builds and installs with Intel SYCL support for fast inference.
- **Python environment**: Uses [uv](https://github.com/astral-sh/uv) for fast Python dependency management.
- **Ready-to-use server**: Runs the llama-cpp-python server by default.

---

## Build the Docker Image

Clone this repository and build the Docker image:
Expand All @@ -17,43 +34,97 @@ Clone this repository and build the Docker image:
docker build -t llamacpp-intel-sycl:latest .
```

---

## Run the Container

### Default container

To run the server with a default HuggingFace model (Qwen/Qwen2-0.5B-Instruct-GGUF):

```sh
docker run --rm --device /dev/dri --gpus all -p 8000:8000 llamacpp-intel-sycl:latest
docker run -it --rm \
--device /dev/dri \
-p 8000:8000 \
llamacpp-intel-sycl:latest
```

- `--device /dev/dri` exposes the Intel GPU to the container.
- `--gpus all` (if using Docker with GPU support) ensures all GPUs are available.
- `-p 8000:8000` maps the server port.

### Custom Model or Arguments
### Custom model or arguments

The default model and arguments can be overridden as needed:

```sh
docker run --rm --device /dev/dri --gpus all -p 8000:8000 llamacpp-intel-sycl:latest \
--hf_model_repo_id <hf-repo> --model <model-file>
docker run -it --rm \
--device /dev/dri \
-p 8000:8000 \
llamacpp-intel-sycl:latest \
--n_gpu_layers -1 \
--hf_model_repo_id <hf-repo> --model <model-file>
```

## Mount a Local Directory and Run a Model from It
- `--n_gpu_layers -1` All layers of the model are offloaded to your GPU.
- Replace `--hf_model_repo_id <hf-repo>` with the HuggingFace model repo id to use.
- Replace `--model <model-file>` with the HuggingFace model file to use.

### Mount a local directory and run a model from it

To use a model file stored on the host machine, mount the directory containing the model into the container and specify the path to the model file. For example, if the model is located in `/path/to/models` on the host:

```sh
docker run --rm --device /dev/dri --gpus all -p 8000:8000 \
-v /path/to/models:/models \
llamacpp-intel-sycl:latest \
--model /models/<model-file>
docker run -it --rm \
--device /dev/dri \
-p 8000:8000 \
-v /path/to/models:/models \
llamacpp-intel-sycl:latest \
--n_gpu_layers -1 \
--model /models/<model-file>
```

- `-v /path/to/models:/models` mounts the local directory into the container at `/models`.
- Replace `<model-file>` with the actual filename of the model inside `/path/to/models`.

This approach can be combined with other arguments as needed.

### Run bash shell in the container

To override the entry point and start the container with the bash shell, run the following command:

```sh
docker run -it --rm \
--device=/dev/dri \
--net host \
--entrypoint /bin/bash \
llamacpp-intel-sycl:latest

```
- `--net host` Instead of creating a own different network in the container, it is going to use host network.
- To override the default server entry point and start the container with a Bash shell instead of the default command, we can use `--entrypoint /bin/bash`.

---

## Notes

- Make sure your host system has an Intel GPU and the necessary drivers installed.
- For more information about supported models, server options, and how to call inference endpoints, see the [llama-cpp-python OpenAI Server documentation](https://llama-cpp-python.readthedocs.io/en/latest/server/).
- If you're behind some proxies, please update the `config.json` file with the correct proxy settings before running the container.
```sh
## ~/.docker/config.json
{
"proxies":{
"default": {
"httpProxy": <your-proxy-details>,
"httpsProxy": <your-proxy-details>,
"noProxy": <your-proxy-details>
}
}
}
```

---

## License:

View the LICENSE file for the repository [here](./LICENSE).