Commit 1b7b013
committed
[TRTLLM-11851][feat] MX adapter improvements: env-var fallback, query timeout, model_name plumbing
Three discrete improvements to the MX side of PR NVIDIA#13045 driven by
review feedback from MX team's downstream PR
(chienchunhung/TensorRT-LLM #1) — three orchestration ergonomics fixes
landed as one focused commit so reviewers see them as a clean slice
on top of the prototype.
(1) MODEL_EXPRESS_URL env-var fallback — at validator level
TorchLlmArgs.validate_mx_config now honors the upstream
``MODEL_EXPRESS_URL`` env var when ``checkpoint_format='MX'`` and
``mx_server_url`` is unset. Resolution happens at validator time so
the value ends up on ``llm_args.mx_server_url`` (visible to
logging, /startup_metrics, downstream code) instead of being
silently re-read from env by the loader.
Lets orchestrators (Dynamo) configure MX via the environment
without plumbing every CLI knob, while keeping resolution in one
place. Explicit ``mx_server_url=`` always wins. The env-var
fallback only fires when MX is the active checkpoint format
(so HF-only configs aren't surprised by an unrelated env var).
Empty string in env is treated as unset.
(2) MX_SOURCE_QUERY_TIMEOUT defensive default
MXCheckpointLoader.__init__ calls
``os.environ.setdefault("MX_SOURCE_QUERY_TIMEOUT", "30")`` whenever
an MX server URL is configured. Caps cold-cluster first-replica
startup at 30 s instead of upstream's 1-hour default (the polling
in MxLiveWeightLoader._query_source). setdefault semantics preserve
any explicit user value. HF-only loads (no MX URL) don't touch
the env at all.
The proper upstream-side fix is a non-blocking source-query API
(tracked as MX-4 in §15 of the design doc); this defensive default
caps the worst case until that lands.
(3) model_name plumbing with HF-snapshot-aware resolver
Plumbs ``llm_args.model → MXCheckpointLoader(model_name=...)`` so
upstream's ``publish_model_params()`` publishes under the
user-supplied Hub ID (e.g. "Qwen/Qwen2.5-72B-Instruct") instead of
the "unknown" sentinel.
- MXCheckpointLoader takes a new optional ``model_name``
constructor arg (Union[str, Path]). Coerced to str at
construction time.
- publish_as_source() now sets BOTH MODEL_EXPRESS_URL and
MODEL_NAME env vars (resolving identity via the priority order
below) and restores both env vars in finally.
publish_model_params() reads them via env, as documented.
- Identity resolution order: explicit constructor arg →
MODEL_NAME env → checkpoint_dir basename (with HF-snapshot path
unmangling) → "unknown".
- HF cache layout (".../models--<org>--<name>/snapshots/<sha>/")
is unmangled back to "<org>/<name>" instead of returning the
commit hash.
- _construct_checkpoint_loader plumbs ``mx_model_name`` through;
py_executor_creator.py extracts it from llm_args.model.
Both env-var dances (MODEL_EXPRESS_URL + MODEL_NAME) collapse into
one direct call when MX-2 (public build_identity) lands upstream.
Tests for these three additions are in the next commit.
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor1 parent 8ecfa78 commit 1b7b013
4 files changed
Lines changed: 153 additions & 17 deletions
File tree
- tensorrt_llm
- _torch
- models/checkpoints/mx
- pyexecutor
- llmapi
Lines changed: 126 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| |||
38 | 40 | | |
39 | 41 | | |
40 | 42 | | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
41 | 53 | | |
42 | 54 | | |
43 | 55 | | |
| |||
68 | 80 | | |
69 | 81 | | |
70 | 82 | | |
| 83 | + | |
71 | 84 | | |
72 | 85 | | |
73 | 86 | | |
| |||
78 | 91 | | |
79 | 92 | | |
80 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
81 | 100 | | |
82 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
83 | 112 | | |
84 | 113 | | |
85 | 114 | | |
| |||
89 | 118 | | |
90 | 119 | | |
91 | 120 | | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
92 | 132 | | |
93 | 133 | | |
94 | 134 | | |
| |||
221 | 261 | | |
222 | 262 | | |
223 | 263 | | |
224 | | - | |
225 | | - | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
226 | 268 | | |
227 | | - | |
228 | | - | |
| 269 | + | |
229 | 270 | | |
230 | 271 | | |
231 | 272 | | |
| |||
238 | 279 | | |
239 | 280 | | |
240 | 281 | | |
241 | | - | |
242 | | - | |
243 | | - | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
244 | 298 | | |
245 | | - | |
246 | | - | |
247 | 299 | | |
248 | 300 | | |
249 | 301 | | |
250 | | - | |
| 302 | + | |
251 | 303 | | |
| 304 | + | |
252 | 305 | | |
253 | 306 | | |
254 | 307 | | |
| |||
257 | 310 | | |
258 | 311 | | |
259 | 312 | | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
| 174 | + | |
174 | 175 | | |
175 | 176 | | |
176 | 177 | | |
| |||
187 | 188 | | |
188 | 189 | | |
189 | 190 | | |
190 | | - | |
191 | | - | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
192 | 196 | | |
193 | 197 | | |
194 | 198 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
252 | 252 | | |
253 | 253 | | |
254 | 254 | | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
255 | 258 | | |
256 | 259 | | |
257 | 260 | | |
258 | 261 | | |
259 | 262 | | |
| 263 | + | |
| 264 | + | |
260 | 265 | | |
261 | 266 | | |
262 | 267 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3940 | 3940 | | |
3941 | 3941 | | |
3942 | 3942 | | |
| 3943 | + | |
| 3944 | + | |
| 3945 | + | |
| 3946 | + | |
| 3947 | + | |
| 3948 | + | |
| 3949 | + | |
| 3950 | + | |
| 3951 | + | |
| 3952 | + | |
| 3953 | + | |
| 3954 | + | |
| 3955 | + | |
| 3956 | + | |
| 3957 | + | |
| 3958 | + | |
3943 | 3959 | | |
3944 | 3960 | | |
3945 | 3961 | | |
| |||
0 commit comments