Skip to content

Feat/update unsloth template#12

Open
AanningZoeng wants to merge 3 commits intomainfrom
feat/update-unsloth-template
Open

Feat/update unsloth template#12
AanningZoeng wants to merge 3 commits intomainfrom
feat/update-unsloth-template

Conversation

@AanningZoeng
Copy link
Copy Markdown
Collaborator

Summary

Type of Change

  • New template
  • Update to existing template (version bump, new packages, etc.)
  • Bug fix
  • Shared infrastructure change (container-template/)
  • Documentation only

Checklist

  • Dockerfile builds successfully (docker buildx bake <target> --no-cache)
  • Container starts and all services are reachable
  • README.md updated to reflect any interface or version changes
  • No secrets, tokens, or internal hostnames committed
  • docker-bake.hcl target name and image tag are correct

Test Notes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Unsloth official template (and aligns parts of the PyTorch template) by changing the base image/runtime setup, package installs, and startup/entrypoint behavior.

Changes:

  • Reworked Unsloth Dockerfile to build Python from source, install torch/cu128 + unsloth tooling, and add nginx/Jupyter configuration.
  • Updated both Unsloth and PyTorch templates to use a new start1.sh entrypoint and adjusted environment defaults.
  • Modified image tagging in both templates’ docker-bake.hcl.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
official-templates/unsloth/Dockerfile Major rebuild of the Unsloth image (Python install approach, packages, Jupyter/nginx wiring, new entrypoint).
official-templates/unsloth/docker-bake.hcl Tag updated for Unsloth publish target.
official-templates/pytorch/Dockerfile Updates startup expectations (new entrypoint), Jupyter config, and removes conda/uv sections.
official-templates/pytorch/docker-bake.hcl Updates TAG_SUFFIX and changes tag format for PyTorch image publishing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 9 to +17
FROM scratch AS logo
COPY container-template/yotta.txt yotta.txt

FROM scratch AS scripts
COPY container-template/start.sh start.sh
COPY container-template/start1.sh start1.sh

FROM scratch AS proxy
COPY container-template/proxy/nginx.conf nginx.conf
COPY container-template/proxy/readme.html readme.html
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scratch stages copy files from container-template/..., but that directory is not part of the official-templates/unsloth build context (and docker-bake.hcl already provides named contexts for scripts/proxy/logo). As written, this will fail to build unless the build context is changed; use COPY --from=<named-context> (or set context appropriately) instead of COPY container-template/....

Also, container-template/start1.sh does not exist in the repo, so the build will fail even if the context issue is fixed.

Copilot uses AI. Check for mistakes.
Comment on lines 20 to +26
# main stage
# ===============================
FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04

ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=UTC \
# Re-declare ARGs after FROM
ARG PYTHON_VERSION="3.11.14"

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker-bake.hcl provides a BASE_IMAGE build arg for this template, but the Dockerfile hard-codes FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04. This makes the bake arg ineffective and contradicts the template documentation that says it is built on unsloth/unsloth. Consider switching to ARG BASE_IMAGE + FROM ${BASE_IMAGE} (or removing the arg/docs if the base image is intentionally fixed).

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +37
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH \
JUPYTER_PASSWORD=yotta
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting JUPYTER_PASSWORD to a fixed default (yotta) changes the runtime behavior of the shared startup convention: Jupyter becomes enabled by default (in container-template/start.sh, Jupyter starts whenever JUPYTER_PASSWORD is set). This also makes every container share the same token unless users override it. Consider leaving it unset by default and documenting/passing it at runtime instead.

Suggested change
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH \
JUPYTER_PASSWORD=yotta
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +136
'c.ServerApp.token = "yotta"' \
'c.ServerApp.password = ""' \
'c.ServerApp.allow_remote_access = True' \
'c.ServerApp.allow_origin = "*"' \
'c.NotebookApp.token = "yotta"' \
'c.NotebookApp.password = ""' \
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jupyter config hard-codes a token (c.ServerApp.token = "yotta") and allows any origin. This prevents runtime configuration via JUPYTER_PASSWORD and increases exposure if the notebook server is reachable. Prefer sourcing the token from an env var at startup (or omit the config and let the startup script/CLI flags control auth/origin).

Suggested change
'c.ServerApp.token = "yotta"' \
'c.ServerApp.password = ""' \
'c.ServerApp.allow_remote_access = True' \
'c.ServerApp.allow_origin = "*"' \
'c.NotebookApp.token = "yotta"' \
'c.NotebookApp.password = ""' \
'c.ServerApp.allow_remote_access = True' \

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +192
# ===============================
# Entrypoint: root runs start1.sh
# ===============================
USER root
WORKDIR /root
CMD ["/bin/bash", "-c", "exec /bin/bash /start1.sh"]
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image now runs /start1.sh, but the template docs (README) instruct users to run /start.sh, and the shared container-template only contains start.sh today. Either add/commit start1.sh (and update docs accordingly) or keep the entrypoint aligned with /start.sh to avoid breaking documented usage.

Copilot uses AI. Check for mistakes.
@@ -22,13 +22,13 @@ ENV DEBIAN_FRONTEND=noninteractive \
SHELL=/bin/bash \
PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:$PATH \
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Dockerfile removed the Miniconda installation, but PATH still prepends /opt/conda/bin. With conda no longer present, this is misleading and can mask issues (e.g., users expecting conda to exist). Consider removing /opt/conda/bin from PATH (or reintroducing conda if it’s still required).

Suggested change
PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:$PATH \
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:$PATH \

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +110
# Configure JupyterLab: auto-login with token (no password prompt)
# ===============================
RUN mkdir -p /root/.jupyter && printf '%s\n' \
'c.ServerApp.token = "yotta"' \
'c.ServerApp.password = ""' \
'c.ServerApp.allow_remote_access = True' \
'c.ServerApp.allow_origin = "*"' \
'c.NotebookApp.token = "yotta"' \
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jupyter config hard-codes a token ("yotta") and allows any origin. This prevents controlling auth via the documented JUPYTER_PASSWORD env var and increases exposure if the notebook server is reachable. Prefer sourcing the token from an env var at startup (or omit this file and rely on the startup script/CLI flags).

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +142
COPY --from=scripts start1.sh /start1.sh
RUN chmod 755 /start1.sh && \
sed -i 's/\r$//' /start1.sh
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COPY --from=scripts start1.sh /start1.sh references start1.sh, but the shared container-template/ currently only contains start.sh (no start1.sh). This will fail at build time unless start1.sh is added to the scripts context or the Dockerfile is switched back to /start.sh.

Copilot uses AI. Check for mistakes.

tags = [
"${PUBLISHER}/pytorch:2.9.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04"
"${PUBLISHER}/pytorch:${TAG_SUFFIX}"
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image tag was changed to ${PUBLISHER}/pytorch:${TAG_SUFFIX}, which drops the descriptive version components (torch/cuda/python/ubuntu) used elsewhere in this repo’s template tags (e.g., ...:cuda12.8.1-ubuntu22.04-${TAG_SUFFIX}). Consider keeping the descriptive prefix and appending the date suffix to preserve discoverability and avoid ambiguity.

Suggested change
"${PUBLISHER}/pytorch:${TAG_SUFFIX}"
"${PUBLISHER}/pytorch:cuda12.8.1-ubuntu22.04-${TAG_SUFFIX}"

Copilot uses AI. Check for mistakes.

args = {
BASE_IMAGE = "nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04"
BASE_IMAGE = "nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04"
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are trailing spaces after the BASE_IMAGE value. This is minor, but it creates noisy diffs and can fail stricter formatting checks—please trim the whitespace.

Suggested change
BASE_IMAGE = "nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04"
BASE_IMAGE = "nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04"

Copilot uses AI. Check for mistakes.
@dadaism
Copy link
Copy Markdown
Contributor

dadaism commented Apr 15, 2026

PR Hygiene
Summary and test notes empty — as with all prior PRs.
Fifth PR with identical pytorch changes — PRs #6, #9, #10, #11, #12. Please land these once and rebase.
Image tag doesn't match actual stack — docker-bake.hcl:25 says unsloth:0.6.9-py3.11-cuda12.1-... but the Dockerfile installs whatever latest unsloth is on PyPI (not pinned to 0.6.9) and uses CUDA 12.8.1. The tag is wrong on both version and CUDA.
Unsloth Dockerfile — Issues
Critical
tini removed from entrypoint — The old image used ENTRYPOINT ["/usr/bin/tini","--"] for PID 1 signal handling and zombie reaping. The new entrypoint is bare bash. tini is no longer even installed (removed from the apt layer). Child processes (Jupyter, nginx, sshd) won't be reaped on exit. Either add tini back to apt and use it as ENTRYPOINT, or install and use dumb-init.

COPY . /build_context removed but no replacement for local notebooks — The old Dockerfile had a (clunky) mechanism to overlay local unsloth-notebooks/ from the build context. That's gone, but the git clone of the upstream notebooks repo remains. If the team had local notebook customizations in the build context, those are now silently dropped.

SSH config not configured — Same issue across all PRs. The old sshd wrapper and user setup are removed. start1.sh calls service ssh start, but without PermitRootLogin yes in sshd_config, root SSH won't work. The old image ran SSH via a custom wrapper that stripped -D; that's also gone.

vLLM added without documentation or exposed port — Dockerfile:93 installs vllm but the template has no README mention of it, no vLLM startup logic, and no port exposed for it (8000). If this is intentional (library use only), it's a very heavy dependency (~2GB+ in the image) for import-only usage. If it's meant to be served, it needs an entrypoint hook.

Medium
Unsloth not pinned — pip install unsloth installs latest. The bake tag claims 0.6.9. Pin it: unsloth==0.6.9 or update the tag.

Unpinned notebook clone — Dockerfile:118. git clone --depth=1 of HEAD from unslothai/notebooks. Not reproducible.

chmod 777 /workspace /root — Dockerfile:42. Root's home directory world-writable. Same issue in the "Supervisor dirs" block at line ~133 with chmod 777 /var/log/supervisor /workspace /var/run /var/lib/nginx.

Duplicate directory creation — /var/run/sshd, /var/log/supervisor, and /workspace are created/chmod'd in multiple RUN layers (apt block, workspace block, supervisor block).

--enable-optimizations removed from Python build — Recurring issue.

Hardcoded Jupyter token "yotta" — Recurring issue across all templates.

sed -i 's/\r$//' on start1.sh — Fix line endings at source.

Low
sudo still installed but no non-root users exist — Dockerfile:52. The ubuntu and unsloth users are removed, and the container runs as root. sudo is dead weight in the image.

Build-time assertion doesn't import unsloth — Dockerfile:103. The comment says "unsloth requires GPU at import time" so it only checks importlib.metadata.version. Fair enough, but consider python -c "import unsloth" with a || true fallback so you at least see if it explodes at build time.

Old entry.sh with Jupyter-as-PID1 removed — Good cleanup. But note that the old image ran Jupyter directly via entry.sh with --ServerApp.token="" (no auth). The new image uses start1.sh which sets --ServerApp.token=$JUPYTER_PASSWORD. This is a behavior change — existing users who expected no-auth Jupyter will now need the token. Worth documenting.

Pytorch Dockerfile — Recurring Issues
Same as PRs #6/#9/#10/#11. All still apply.

Positive Changes
Removing the venv in favor of system Python simplifies the image
Removing the COPY . /build_context hack is a good cleanup
Removing the sshd wrapper (which stripped -D) simplifies SSH
Adding SHELL ["/bin/bash", "-o", "pipefail", "-c"] is a best practice
The build-time assertion for package versions is a nice touch
Summary
The rewrite is a clear improvement in structure — cleaner apt layer, no venv, no user juggling. Main concerns:

tini is gone — zombie processes won't be reaped
SSH will reject root logins — no sshd_config patching
Image tag is wrong — says cuda12.1/unsloth-0.6.9 but is cuda12.8.1/unsloth-latest
vLLM added without purpose — heavy dependency with no startup logic or documentation
Split the pytorch changes out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants