This repository provides a Dockerfile and instructions to pull a prebuilt image from Docker Hub to run a Deepspeed-MII server to serve your locally deployed LLM. The container runs DeepSpeed-MII with an OpenAI-API-compatible endpoint. Launching for the first time can take around 5 minutes. Once running, you can load any Hugging Face model (e.g., mistralai/Mistral-7B-Instruct-v0.3) or simply use the preloaded image on Docker Hub to interact via /v1/chat/completions just like you would with the official OpenAI API.
- Prerequisites
- Repository Structure
- Option A: Pull the Prebuilt Image
- Option B: Build from Source
- Running the Container
- Testing with Linux CLI (
curl) - Testing with Python + OpenAI SDK
- Environment Variables
- Customizing & Troubleshooting
- License
-
Docker (20.10+).
-
NVIDIA Container Toolkit (to allow
--gpus all). -
A valid Hugging Face Hub token if you plan to pull private or gated models:
export HF_TOKEN=<your_hf_token>
-
(Optional) OpenAI Python SDK for testing in Python:
pip install openai
deepSpeed-mii-container/
├── Dockerfile
├── readme.md
└── .gitignore
Dockerfile: Builds a CUDA-enabled image with DeepSpeed-MII, Pydantic v2,pydantic-settings,sentencepiece, FastAPI, Uvicorn, ShortUUID, and FastChat. Exposes port 23333 by default.readme.md: This file—contains instructions for pulling or building, running, and testing..gitignore: Ignores local artifacts like__pycache__and logs.
If you want to skip building locally, simply pull the prebuilt image from Docker Hub:
# Pull the image (tagged "latest")
docker pull slinusc/deepspeed-mii:latestNow you can jump to Running the Container below, using slinusc/deepspeed-mii:latest as the image name.
Whether you pulled the prebuilt image or built locally, run the container with GPU support, mounting your Hugging Face cache, and exposing port 23333:
# Using the prebuilt Docker Hub image:
docker run --runtime=nvidia --gpus all \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
-p 127.0.0.1:23333:23333 \
--ipc=host \
slinusc/deepspeed-mii:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--port 23333--runtime=nvidia --gpus all: Access all GPUs.-v $HOME/.cache/huggingface:/root/.cache/huggingface: Mount HF cache so weights aren’t re-downloaded.-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN: Pass HF token into container.-p 127.0.0.1:23333:23333: Map container port 23333 → host 23333.--ipc=host: Share IPC namespace to reduce overhead.--model mistralai/Mistral-7B-Instruct-v0.3: Hugging Face path of the model to load.--port 23333: Force Uvicorn to bind inside container on port 23333.
You should see logs like:
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
At that point, the server is live at http://127.0.0.1:23333/v1/....
Once the container is running, open a new terminal and run:
curl http://127.0.0.1:23333/v1/modelsExpected JSON:
{
"object": "list",
"data": [
{
"id": "mistralai/Mistral-7B-Instruct-v0.3",
"object": "model",
"created": 1748684820,
"owned_by": "deepspeed-mii",
"root": "mistralai/Mistral-7B-Instruct-v0.3",
"parent": null,
"permission": [ … ]
}
]
}The key field is "id", which you must use for subsequent requests.
curl http://127.0.0.1:23333/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Tell me a fun fact about penguins." }
],
"max_tokens": 32,
"temperature": 0.7
}'curl http://127.0.0.1:23333/v1/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "Once upon a time in a distant galaxy,",
"max_tokens": 50,
"temperature": 0.7
}'You’ll receive a JSON with a choices array containing the generated completion.
Install the OpenAI SDK locally (if you haven’t already):
pip install openaiSave the following as test_mii.py:
from openai import OpenAI
# Point to local MII endpoint:
client = OpenAI(
api_key="", # no key needed if container does not require auth
base_url="http://127.0.0.1:23333/v1"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How many wings does a penguin have?"}
],
max_tokens=16,
temperature=0.7
)
print(response.choices[0].message.content)Run:
python3 test_mii.pyYou should see a short answer about penguins printed. If it errors, ensure:
- The container is still running.
- The correct
modelID is used. - Port 23333 is mapped.
-
HF_TOKEN(orHUGGING_FACE_HUB_TOKENinside the container)- Your Hugging Face Hub token for private gated models.
export HF_TOKEN=<your_hf_token>
-
OPENAI_API_KEY(optional)- If you configured the container to require an API key, set this on your host and pass it with
-e OPENAI_API_KEYindocker run. Otherwise, the container defaults to no-auth mode.
- If you configured the container to require an API key, set this on your host and pass it with
In Dockerfile, you can swap:
FROM nvidia/cuda:12.2.2-devel-ubuntu20.04for another tag such as cuda:11.8-devel-ubuntu20.04 to match your GPU driver.
Change the --model argument when running:
--model your-org/your-model-nameIf you want to load a quantized checkpoint, append --quantize gptq or similar.
This repository is licensed under the MIT License. See LICENSE for details. If you omit a LICENSE file, it defaults to “All rights reserved.”
Congratulations! You have a fully functional Docker container that runs DeepSpeed-MII in OpenAI-API compatibility mode. Anyone can now either pull the prebuilt image (slinusc/deepspeed-mii:latest) or build from source and run a local inference server for Mistral or any other Hugging Face model.