Skip to content

Commit dafd239

Browse files
Merge branch 'main' into vdinh/glm4moe-mtp-boundary-shard-fix
2 parents beb4ae6 + 6035e7c commit dafd239

55 files changed

Lines changed: 3569 additions & 222 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.main.commit

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
6204b925f3da8b998524c6bb47a9ca779d95ce2e
1+
a95d866d165727250b711957587c2edbc9952f10

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121

2222
- [05/20/2026] [**Nemotron-3 Nano Omni**](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) day-0 branch support is now merged on **main**! The 30B-A3B MoE multimodal model supports image, video, audio, and text workflows with checkpoint conversion, inference, SFT, and PEFT (LoRA) examples. Read the [NVIDIA Blog](https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/) and see the [examples README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/nemotron/nemotron_3_omni/README.md) for the full walkthrough.
2323

24-
- [05/19/2026] [**Nemotron-Labs Diffusion**](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/diffusion/recipes/nemotron_labs_diffusion) is now supported on **main** with autoregressive-to-diffusion conversion, continuous pretraining, checkpoint conversion, and inference workflows. Read the [NVIDIA Research blog](https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive) for the tri-mode language model overview.
24+
- [05/19/2026] [**Nemotron-Labs Diffusion**](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/nemotron_labs_diffusion) is now supported on **main** with autoregressive-to-diffusion conversion, continuous pretraining, checkpoint conversion, and inference workflows. Read the [NVIDIA Research blog](https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive) for the tri-mode language model overview.
2525

2626
- [05/06/2026] [**Gemma 4 VL 26B-A4B**](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/gemma/gemma4_vl) is now supported! Checkpoint conversion, SFT, and PEFT (LoRA) recipes for Google's MoE vision-language model (26B total / 4B active params, 128 experts top-k=8, dual sliding/global attention with K=V tying on full-attention layers) are available on **main**. See the [examples README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/gemma/gemma4_vl/README.md) for the full walkthrough.
2727

docker/Dockerfile.ci

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,3 +111,17 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
111111
uv cache prune ${UV_CACHE_PRUNE_ARGS}
112112

113113
COPY --chmod=644 . /opt/Megatron-Bridge
114+
115+
##############################################################################
116+
##
117+
## Verify the environment imports
118+
##
119+
##############################################################################
120+
121+
# Fail the build if any installed package in /opt/venv cannot be imported. A
122+
# version/ABI skew can leave a package installed yet unimportable, which
123+
# `pip check` does not catch. The build has no GPU, so modules that need a
124+
# driver or host library at import are exempted in import_check_skip.txt.
125+
RUN --mount=type=bind,source=docker/common/import_check.py,target=/opt/import_check.py \
126+
--mount=type=bind,source=docker/common/import_check_skip.txt,target=/opt/import_check_skip.txt \
127+
python /opt/import_check.py --jobs 16 --skip-file /opt/import_check_skip.txt

docker/Dockerfile.fw_base

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,3 +147,17 @@ RUN --mount=type=bind,from=vllm_wheel,source=/src/vllm/,target=/tmp/vllm/ \
147147
FROM ${FW_BASE_FINAL} AS nemo_fw_base_final
148148

149149
WORKDIR /opt
150+
151+
##############################################################################
152+
##
153+
## Verify the environment imports
154+
##
155+
##############################################################################
156+
157+
# Fail the build if any installed package in /opt/venv cannot be imported. A
158+
# version/ABI skew can leave a package installed yet unimportable, which
159+
# `pip check` does not catch. The build has no GPU, so modules that need a
160+
# driver or host library at import are exempted in import_check_skip.txt.
161+
RUN --mount=type=bind,source=docker/common/import_check.py,target=/opt/import_check.py \
162+
--mount=type=bind,source=docker/common/import_check_skip.txt,target=/opt/import_check_skip.txt \
163+
python /opt/import_check.py --jobs 16 --skip-file /opt/import_check_skip.txt

docker/Dockerfile.fw_final

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,20 @@ RUN GRPC_VERSION=1.79.3 && \
139139
-o "${WANDB_CORE_BIN}" ./cmd/wandb-core/ && \
140140
rm -rf /tmp/wandb-src /tmp/go /tmp/gopath
141141

142+
##############################################################################
143+
##
144+
## Verify the environment imports
145+
##
146+
##############################################################################
147+
148+
# Fail the build if any installed package in /opt/venv cannot be imported. A
149+
# version/ABI skew can leave a package installed yet unimportable, which
150+
# `pip check` does not catch. The build has no GPU, so modules that need a
151+
# driver or host library at import are exempted in import_check_skip.txt.
152+
RUN --mount=type=bind,source=docker/common/import_check.py,target=/opt/import_check.py \
153+
--mount=type=bind,source=docker/common/import_check_skip.txt,target=/opt/import_check_skip.txt \
154+
python /opt/import_check.py --jobs 16 --skip-file /opt/import_check_skip.txt
155+
142156
##############################################################################
143157
##
144158
## Finalize FW container

docker/common/fw_pyproject.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,10 @@ override-dependencies = [
7373
"protobuf==6.33.5",
7474
"cuda-python>=13.0.0",
7575
"multi-storage-client!=0.36.0",
76+
# 0.1.12rc0's tvm_ffi is incompatible with the TVM that tilelang vendors and
77+
# breaks `import tilelang`/`import mamba_ssm`; an explicit pin is required
78+
# because prerelease="allow" would otherwise float the unlocked sync onto it.
79+
"apache-tvm-ffi==0.1.11",
7680
"levenshtein; sys_platform == 'never'",
7781
"pytest==8.4.2",
7882
"wandb>=0.25.0",

docker/common/import_check.py

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Smoke-import every installed distribution's top-level modules.
16+
17+
This guards against shipping an environment whose installed packages cannot
18+
actually be imported -- for example an ABI or version skew between a vendored
19+
shared library and its Python bindings. ``pip check`` only validates declared
20+
version ranges; it never exercises an import, so a nominally-satisfied
21+
dependency that crashes the moment it is imported slips through. This script
22+
closes that gap: it enumerates installed distributions, resolves each one's
23+
top-level importable modules, imports every module in an isolated subprocess,
24+
and exits non-zero if any import raises (or hangs).
25+
26+
Run with no arguments to check the active interpreter's environment::
27+
28+
python import_check.py
29+
30+
Because an image is built without a GPU, packages that require a driver or a
31+
host library at import time are not importable at build time. List those in a
32+
skip file (one module per line, ``#`` comments allowed) and pass it via
33+
``--skip-file``; every other import error then fails the build.
34+
"""
35+
36+
from __future__ import annotations
37+
38+
import argparse
39+
import importlib.metadata as md
40+
import subprocess
41+
import sys
42+
from concurrent.futures import ThreadPoolExecutor
43+
from dataclasses import dataclass
44+
from pathlib import Path
45+
46+
47+
@dataclass(frozen=True)
48+
class ImportResult:
49+
"""Outcome of importing a single top-level module.
50+
51+
Attributes:
52+
module: The top-level module name that was imported.
53+
ok: True when the import subprocess exited cleanly.
54+
detail: Trailing diagnostic output when the import failed, else "".
55+
"""
56+
57+
module: str
58+
ok: bool
59+
detail: str
60+
61+
62+
def _clean_module_name(filename: str) -> str | None:
63+
"""Reduce a top-level filename to the module name Python would import.
64+
65+
Extension modules carry an ABI tag (``foo.cpython-312-x86_64-linux-gnu.so``)
66+
that is not part of the import name, so everything from the first dot is
67+
dropped. Pure-Python files lose their ``.py`` suffix.
68+
69+
Args:
70+
filename: The basename of a top-level file.
71+
72+
Returns:
73+
The import name, or None when the file is not an importable module.
74+
"""
75+
if filename.endswith((".so", ".pyd")):
76+
return filename.split(".", 1)[0]
77+
if filename.endswith(".py"):
78+
return filename[:-3]
79+
return None
80+
81+
82+
def _modules_from_files(dist: md.Distribution) -> set[str]:
83+
"""Derive importable top-level modules from a distribution's file list.
84+
85+
Deriving from the recorded files (rather than ``top_level.txt``) reflects
86+
what is actually on disk, which avoids both stale metadata entries that name
87+
non-importable helpers and ABI-tagged extension filenames.
88+
89+
Args:
90+
dist: The installed distribution to inspect.
91+
92+
Returns:
93+
The set of importable top-level module names.
94+
"""
95+
modules: set[str] = set()
96+
for entry in dist.files or []:
97+
parts = entry.parts
98+
if not parts:
99+
continue
100+
head = parts[0]
101+
if head.endswith((".dist-info", ".data", ".egg-info")) or head == "__pycache__":
102+
continue
103+
if len(parts) == 1:
104+
name = _clean_module_name(head)
105+
candidate = name if name and name != "__init__" else None
106+
else:
107+
candidate = head if entry.suffix in {".py", ".so", ".pyd"} else None
108+
if candidate and candidate.isidentifier():
109+
modules.add(candidate)
110+
return modules
111+
112+
113+
def _modules_from_top_level_txt(dist: md.Distribution) -> set[str]:
114+
"""Return importable module names declared in ``top_level.txt``.
115+
116+
Used only when the file list is unavailable.
117+
118+
Args:
119+
dist: The installed distribution to inspect.
120+
121+
Returns:
122+
The set of top-level module names that are valid identifiers.
123+
"""
124+
text = dist.read_text("top_level.txt")
125+
if not text:
126+
return set()
127+
return {line.strip() for line in text.splitlines() if line.strip().isidentifier()}
128+
129+
130+
def discover_modules(skip: set[str]) -> dict[str, list[str]]:
131+
"""Map each importable top-level module to the distributions providing it.
132+
133+
Args:
134+
skip: Module names to exclude from the result.
135+
136+
Returns:
137+
A mapping of module name to the sorted list of distribution names that
138+
provide it, excluding any module in ``skip``.
139+
"""
140+
providers: dict[str, set[str]] = {}
141+
for dist in md.distributions():
142+
name = dist.metadata["Name"] or "<unknown>"
143+
modules = _modules_from_files(dist) or _modules_from_top_level_txt(dist)
144+
for module in modules:
145+
if module in skip:
146+
continue
147+
providers.setdefault(module, set()).add(name)
148+
return {module: sorted(names) for module, names in providers.items()}
149+
150+
151+
def import_one(module: str, timeout: float) -> ImportResult:
152+
"""Import a single module in a fresh subprocess.
153+
154+
Isolation in a subprocess keeps a hard crash (segfault, ``os._exit``) or a
155+
hang in one module from aborting the whole sweep.
156+
157+
Args:
158+
module: The top-level module name to import.
159+
timeout: Seconds to allow before treating the import as hung.
160+
161+
Returns:
162+
The :class:`ImportResult` for this module.
163+
"""
164+
try:
165+
proc = subprocess.run(
166+
[sys.executable, "-c", f"import {module}"],
167+
capture_output=True,
168+
text=True,
169+
timeout=timeout,
170+
)
171+
except subprocess.TimeoutExpired:
172+
return ImportResult(module, False, f"timed out after {timeout:.0f}s")
173+
if proc.returncode == 0:
174+
return ImportResult(module, True, "")
175+
tail = "\n".join((proc.stderr or proc.stdout).strip().splitlines()[-3:])
176+
return ImportResult(module, False, tail or f"exit code {proc.returncode}")
177+
178+
179+
def load_skip(skip_file: Path | None) -> set[str]:
180+
"""Read a newline-delimited skip list, ignoring blanks and comments.
181+
182+
Args:
183+
skip_file: Path to the skip file, or None.
184+
185+
Returns:
186+
The set of module names to skip.
187+
"""
188+
if skip_file is None or not skip_file.exists():
189+
return set()
190+
skip: set[str] = set()
191+
for line in skip_file.read_text().splitlines():
192+
token = line.split("#", 1)[0].strip()
193+
if token:
194+
skip.add(token)
195+
return skip
196+
197+
198+
def run(skip: set[str], jobs: int, timeout: float) -> list[ImportResult]:
199+
"""Import every discovered module and collect results.
200+
201+
Args:
202+
skip: Module names to exclude.
203+
jobs: Maximum number of concurrent import subprocesses.
204+
timeout: Per-import timeout in seconds.
205+
206+
Returns:
207+
The import results sorted by module name.
208+
"""
209+
modules = discover_modules(skip)
210+
with ThreadPoolExecutor(max_workers=jobs) as pool:
211+
results = pool.map(lambda m: import_one(m, timeout), sorted(modules))
212+
return list(results)
213+
214+
215+
def report(results: list[ImportResult], skipped: set[str]) -> int:
216+
"""Print a summary and return the process exit code.
217+
218+
Args:
219+
results: The import results to summarize.
220+
skipped: Module names that were skipped.
221+
222+
Returns:
223+
1 if any import failed, otherwise 0.
224+
"""
225+
failures = [r for r in results if not r.ok]
226+
for failure in failures:
227+
print(f"FAIL {failure.module}")
228+
for line in failure.detail.splitlines():
229+
print(f" {line}")
230+
passed = len(results) - len(failures)
231+
print(f"\nimport check: {passed} ok, {len(failures)} failed, {len(skipped)} skipped, {len(results)} total")
232+
return 1 if failures else 0
233+
234+
235+
def main(argv: list[str] | None = None) -> int:
236+
"""Entry point for the import-check CLI.
237+
238+
Args:
239+
argv: Argument vector, defaulting to ``sys.argv``.
240+
241+
Returns:
242+
The process exit code.
243+
"""
244+
parser = argparse.ArgumentParser(description="Smoke-import installed packages.")
245+
parser.add_argument("--skip-file", type=Path, default=None)
246+
parser.add_argument("--jobs", type=int, default=8)
247+
parser.add_argument("--timeout", type=float, default=120.0)
248+
args = parser.parse_args(argv)
249+
250+
skip = load_skip(args.skip_file)
251+
results = run(skip, args.jobs, args.timeout)
252+
return report(results, skip)
253+
254+
255+
if __name__ == "__main__":
256+
raise SystemExit(main())
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
# Modules exempt from docker/common/import_check.py. A Dockerfile RUN executes
16+
# without a GPU, driver, or optional host libraries, so these modules raise at
17+
# import even though the package is installed correctly. Every entry MUST carry
18+
# a justification; the check fails on any import error that is not listed here.
19+
20+
# Require an NVIDIA driver / libcuda.so.1, present only at runtime on a GPU host:
21+
deep_ep
22+
deep_ep_cpp
23+
hybrid_ep_cpp
24+
torch_tensorrt
25+
quack # imports nvidia-cutlass-dsl MLIR libs, which dlopen libcuda at import
26+
27+
# Requires the libfuse system library (multi-storage-client FUSE backend; optional):
28+
mfusepy
29+
30+
# Legacy DALI/MXNet helper script that imports mxnet, which is not installed:
31+
rec2idx

0 commit comments

Comments
 (0)