Skip to content

Fix Docker deployment#30

Merged
fabnemEPFL merged 21 commits into
EPFLiGHT:mainfrom
HarshaSatyavardhan:fix/docker-deployment
May 12, 2026
Merged

Fix Docker deployment#30
fabnemEPFL merged 21 commits into
EPFLiGHT:mainfrom
HarshaSatyavardhan:fix/docker-deployment

Conversation

@HarshaSatyavardhan
Copy link
Copy Markdown
Contributor

@HarshaSatyavardhan HarshaSatyavardhan commented Mar 25, 2026

Fixes #22

  • Pin base image to lmsysorg/sglang:latest
  • Add --platform=linux/amd64 to fix arm64 CI build
  • Add NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES env vars for GPU visibility
  • Add ENTRYPOINT to make the container executable
  • Add docker-compose.yml with NVIDIA GPU support
  • Add Docker section to README

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the project’s Docker packaging to make deployments reproducible and runnable via docker compose, with explicit GPU-related defaults.

Changes:

  • Pin the Docker base image and force amd64 builds in the Dockerfile; add GPU env vars and an executable ENTRYPOINT.
  • Add a docker-compose.yml service definition requesting NVIDIA GPUs.
  • Add .dockerignore and document Docker build/run steps in the README.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
docker/Dockerfile Pins base image, sets NVIDIA env vars, installs package, and defines ENTRYPOINT.
docker-compose.yml Adds a compose service intended to request NVIDIA GPU access.
README.md Documents docker compose build and docker compose run usage.
.dockerignore Reduces Docker build context size by ignoring caches/artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docker/Dockerfile

RUN pip install --no-cache-dir .

ENTRYPOINT ["python3", "-m", "mmirage.shard_process"]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container ENTRYPOINT runs mmirage.shard_process, but the current config loading relies on ProcessorRegistry being populated via importing processor modules (e.g., the llm processor registers itself at import time). shard_process.py/load_mmirage_config() don’t import mmirage.core.process (or the processor modules), so --config runs are likely to fail with Processor llm not registered. Consider ensuring processor registration happens before calling load_mmirage_config (e.g., import mmirage.core.process in the CLI/module or add explicit registration in config loading).

Suggested change
ENTRYPOINT ["python3", "-m", "mmirage.shard_process"]
ENTRYPOINT ["python3", "-c", "import mmirage.core.process, runpy; runpy.run_module('mmirage.shard_process', run_name='__main__')"]

Copilot uses AI. Check for mistakes.
Comment thread README.md Outdated
Comment on lines +33 to +34
The container requires an NVIDIA GPU. The `docker-compose.yml` handles GPU access automatically.

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README implies GPU access will work automatically via docker-compose.yml, but users still need host-side prerequisites (NVIDIA drivers + NVIDIA Container Toolkit / nvidia-container-runtime) for the container to see GPUs. Consider adding a short prerequisite note (and any minimum Docker/Compose versions) so docker compose run failures are easier to diagnose.

Suggested change
The container requires an NVIDIA GPU. The `docker-compose.yml` handles GPU access automatically.
The container requires an NVIDIA GPU. The `docker-compose.yml` is configured to request GPU access, but the host must have:
- NVIDIA GPU drivers installed
- NVIDIA Container Toolkit / `nvidia-container-runtime` configured for Docker
- A recent Docker Engine and Docker Compose version with GPU support enabled
Without these host-side prerequisites, `docker compose run` may fail to detect or use the GPU.

Copilot uses AI. Check for mistakes.
@HarshaSatyavardhan
Copy link
Copy Markdown
Contributor Author

@fabnemEPFL The repetitive Error: Username and password required crashes in the CI were happening because GitHub Actions intentionally hides repository secrets when a Pull Request is submitted from a fork. Because the original workflow attempted to authenticate with Docker Hub on every single PR, it was crashing when those hidden secrets evaluated to empty strings.

To fix this, I updated the workflow to safely skip the login-action and push steps when github.event_name == 'pull_request'. This allows the CI to verify that the Docker image builds successfully without crashing on authentication.

I also applied the Copilot suggestions for the ENTRYPOINT in the Dockerfile and updated the GPU prerequisites in the README.md. I tested this workflow on my own fork and successfully built both the amd64 and arm64 images. I've resolved the merge conflicts and restored the original IMAGE_NAME, so this should be ready to review!"
Screenshot 2026-03-28 at 4 39 29 PM

Comment thread docker/Dockerfile Outdated
@@ -1,5 +1,11 @@
FROM docker.io/lmsysorg/sglang:latest
FROM --platform=linux/amd64 lmsysorg/sglang:v0.5.8.post1-runtime
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only AMD?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why not keep the latest image?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Tested on a local AMD system and forgot to remove the platform flag before pushing, my
    bad 😅 removed it now.
  • Pinned the version for reproducibility since latest keeps changing. Switched back to
    latest now.

Comment thread README.md
docker compose run --rm mmirage --config configs/your_config.yaml
```

The container requires an NVIDIA GPU. The `docker-compose.yml` is configured to request GPU access, but the host must have:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should also be a CPU-only image (there will soon be support for API-based LLMs that do not require GPUs)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have created the Dockerfile.cpu using python:3.11-slim but sglang is a hard dependency in the pyproject.toml. so its fails because it needs CUDA headers doesnt exist in CPU image.

one thing we can do is move sglang and realted dependencies to project.optional-dependencies in project.toml
or we can wrap the SGlang import in the llm_processor.py with try/except
how would you like me to proceed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take inspiration from mmore's dependencies, with rules separated depending on CPU/GPU

@HarshaSatyavardhan
Copy link
Copy Markdown
Contributor Author

  • moved sglang and related packages to [project.optional-dependencies.gpu] in pyproject.toml
  • wrapped the import in llm_processor.py with try/except
  • updated both Dockerfiles (pip install .[gpu] for GPU, pip install . for CPU)
Screenshot 2026-04-03 at 2 37 58 AM

@fabnemEPFL fabnemEPFL self-requested a review April 14, 2026 11:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""
super().__init__(engine_args, **kwargs)
if not SGLANG_AVAILABLE:
raise RuntimeError("SGLang is not installed. Install with: pip install '.[gpu]'")
Comment thread README.md Outdated
Comment on lines 13 to 18
@@ -14,14 +14,43 @@ To install the library, you can clone it from GitHub and then use pip to install

```bash
git clone git@github.com:EPFLiGHT/MMIRAGE.git
pip install -e ./MMIRAGE
pip install -e './MMIRAGE[gpu]'
```
Comment thread docker/Dockerfile Outdated

RUN pip install --no-cache-dir .[gpu]

ENTRYPOINT ["python3", "-c", "import mmirage.core.process, runpy; runpy.run_module('mmirage.shard_process', run_name='__main__')"]
Comment thread docker/Dockerfile.cpu Outdated

RUN pip install --no-cache-dir .

ENTRYPOINT ["python3", "-c", "import mmirage.core.process, runpy; runpy.run_module('mmirage.shard_process', run_name='__main__')"]
Comment thread docker/Dockerfile
Comment on lines +1 to +4
FROM lmsysorg/sglang:latest

ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the description

Comment thread .github/workflows/docker.yml Outdated
Comment on lines 92 to 93
${{ secrets.DOCKERHUB_USERNAME }}/${{ matrix.name }}:latest-${{ matrix.tag_base }}
${{ secrets.DOCKERHUB_USERNAME }}/${{ matrix.name }}:${{ github.sha }}-${{ matrix.tag_base }}
Comment thread .github/workflows/docker.yml Outdated
Comment on lines 24 to 31
@@ -26,6 +29,14 @@ jobs:
path: docker/Dockerfile
tag_base: arm64
name: mmirage-git
Copy link
Copy Markdown

@JCHAVEROT JCHAVEROT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HarshaSatyavardhan for your contribution!

I will most likely address the comments I made in a subsequent PR, so no need to make any changes

Comment on lines +27 to +34
- platform: ubuntu-latest
path: docker/Dockerfile.cpu
tag_base: amd64
name: mmirage-git-cpu
- platform: ubuntu-24.04-arm
path: docker/Dockerfile
path: docker/Dockerfile.cpu
tag_base: arm64
name: mmirage-git
name: mmirage-git-cpu
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird to not use the same base image depending on the target platform, aditionnallyubuntu-latest does support linux/arm64 as per the documentation

As a future improvement we could have a workflow that reuse ubuntu-latest with several target platforms

Comment thread docker/Dockerfile

RUN pip install --no-cache-dir .[gpu]

ENTRYPOINT ["python3", "-m", "mmirage.shard_process"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's wrong to use this entrypoint, this means you can only run the mmirage.shard_process and not MMIRAGE's CLI

Suggested change
ENTRYPOINT ["python3", "-m", "mmirage.shard_process"]
ENTRYPOINT ["mmirage"]

This way if you can directly call the CLI command run specifically designed to call mmirage.shard_process, and you can easily call the other CLI commands without workaround

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with Fabrice, to be coherent with the documentation we will drop the entrypoint so that we just prefix each command by mmirage as said everywhere

Comment thread README.md
Comment on lines +37 to +39
```bash
docker compose build
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you build both the GPU and CPU images while only actually using one of the two in the end

On my computer the build takes >10min, we could make it more optimized

Comment thread docker/Dockerfile
@@ -1,5 +1,11 @@
FROM docker.io/lmsysorg/sglang:latest
FROM lmsysorg/sglang:latest
Copy link
Copy Markdown

@JCHAVEROT JCHAVEROT May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing on RCP, lmsysorg/sglang:latest is based over CUDA 13 but dependency sgl_kernel requires a CUDA 12 lib, so we'll have to fix the base image to an older version under CUDA 12

Comment thread pyproject.toml

[project.optional-dependencies]
gpu = [
"sglang>=0.5.2",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"sglang>=0.5.2",
"sglang==0.5.10",

@fabnemEPFL fabnemEPFL merged commit 4668761 into EPFLiGHT:main May 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix the docker deployment

4 participants