ci: Update transformers to latest version 5.5.4#1823
Open
svcnvidia-nemo-ci wants to merge 31 commits intomainfrom
Open
ci: Update transformers to latest version 5.5.4#1823svcnvidia-nemo-ci wants to merge 31 commits intomainfrom
svcnvidia-nemo-ci wants to merge 31 commits intomainfrom
Conversation
feat: integrate NeMo-Run launcher for managed job submission (#1668) * feat: integrate NeMo-Run launcher for managed job submission Rewrite the NeMo-Run launcher to use nemo_run.Script with inline torchrun commands instead of run.Partial, enabling remote execution via any NeMo-Run executor (Slurm, Kubernetes, local, etc.). Named executors are loaded from $NEMORUN_HOME/executors.py and YAML overrides (nodes, devices, container_image, time, mounts, env_vars) are applied on top. nemo-run is an optional dependency — all imports are guarded behind try/except so the package is never required at import time. * refactor: remove standalone nemorun example, add conversion guide in docs Replace the separate example YAML with an inline guide showing how to convert any existing config to use NeMo-Run by adding a nemo_run: block. * fix: remove secret-like test values flagged by CI scanner * docs: update container image to nvcr.io/nvidia/nemo-automodel:26.02 * feat: native Torchrun launcher, PatternPackager, and generic executor overrides - Use NeMo-Run's native Torchrun launcher instead of hand-rolled inline torchrun scripts. NeMo-Run now manages rendezvous, node ranks, and nproc-per-node automatically. - Ship the training config via PatternPackager so it is extracted to /nemo_run/code/ and available inside the container. - Replace hardcoded override fields (nodes, devices, partition, etc.) with a generic overrides dict — any YAML key not recognised as a launcher setting is applied to the executor via setattr. Dicts are merged, lists are extended, scalars are replaced. Verified end-to-end: Qwen3-MoE-30B fine-tuning completed successfully with W&B logging on a remote Slurm cluster. * fix: use executor.nproc_per_node() for generic device count Use the base Executor.nproc_per_node() method instead of checking executor-specific attribute names. This works across all NeMo-Run backends (Slurm, Kubernetes, Docker, local). * fix: narrow exception catch to NotImplementedError, fix docs to use native executor attrs - Catch only NotImplementedError in nproc_per_node() fallback so real configuration errors propagate instead of silently defaulting to 1. - Remove misleading 'devices' field from docs; all overrides use native executor attribute names (ntasks_per_node, gpus_per_node, etc.) via generic setattr. * fix: use container_mounts instead of mounts in docs example The override key must match the executor's actual attribute name since apply_overrides uses setattr directly. * fix: CI failures — unused import, spec None guard, toctree entry - Remove unused `tempfile` import (linting failure). - Guard against `spec_from_file_location` returning None (type check failure and review comment). - Add nemo-run.md to docs/index.md toctree (docs build failure). * fix: also catch AttributeError in nproc_per_node fallback Handle executors that don't define nproc_per_node() at all (e.g. third-party or future backends). * style: apply ruff-format to config.py and utils.py --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: mute warning spam (#1721) * structure plan logging * mute pydantic warnings * mute pydantic warnings and rank > 0 if started with torchrun * mute pydantic warnings and rank > 0 if started with torchrun * mute warnings * Update parallelizer.py --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
fix: handle dict-typed chat_template in format_chat_template (#1696) Some tokenizers (e.g. Cohere Command R) store multiple named chat templates as a dict instead of a single string. The regex checks on tokenizer.chat_template assumed a string, causing TypeError. Resolve the template string from the dict (using "default" key) before performing regex searches. Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: handle dict-typed chat_template in format_chat_template (#1696) Some tokenizers (e.g. Cohere Command R) store multiple named chat templates as a dict instead of a single string. The regex checks on tokenizer.chat_template assumed a string, causing TypeError. Resolve the template string from the dict (using "default" key) before performing regex searches. Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: adding lora to diffusion (#1653) * update * update * update * Fix tests and linting * fmt * use am peft * delete lora component and use _peft component instead --------- Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: linnan wang <linnanw@nvidia.com> Co-authored-by: Pranav Prashant Thombre <pthombre@nvidia.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…1707) feat: MoE model benchmarks, LoRA configs, and flops calculators (#1676) * feat: add benchmark configs for top MoE models and fix VL model support Add TE+DeepEP benchmark configs for GLM-4.7-Flash, GLM-4.7, GLM-5, MiniMax-M2.5, Mistral Small 4, Qwen3.5 MoE, Qwen3 VL 235B, and Step3.5-Flash. Fix VL composite config handling in benchmark recipe, flops utils, and pipeline sharding to fall back to text_config for models where vocab_size, hidden_size, and num_hidden_layers are nested under text_config (e.g. Qwen3.5 MoE, Qwen3 VL 235B). Add support for custom config classes (e.g. DeepseekV32Config) in benchmark vocab_size inference by respecting the _target_ field. * feat: add LoRA benchmark configs for GLM-4.7-Flash and Qwen3.5 MoE Add TE+DeepEP LoRA benchmark configs with moe_rank_scaling enabled for GLM-4.7-Flash (30B) and Qwen3.5 MoE (35B-A3B). * feat: add renamed LoRA benchmark configs for MoE models * feat: add LoRA configs for Qwen3.5 MoE and GLM-4.7-Flash Rename te_deepep_lora configs to _lora for consistency. * perf: tune gptoss-120b benchmark config for hybridep with TE attention Switch to TE attention backend, hybridep dispatcher with 64 SMs, torch_mm experts, disable activation checkpointing & reshard_after_forward, and reduce local batch size to 1 for better memory/perf profile. * fix: improve Kimi VL model robustness and MoE parallelizer defaults - Fix Kimi K2.5 VL model to handle missing vision_config attributes by falling back to vt_* prefixed and mm_hidden_size attributes - Enable reshard_after_forward for GLM-4.7-Flash LoRA config - Use bf16 reduce_dtype in MoE parallelizer FSDP mixed precision policy * ci: update secrets baseline for nemotron_super_v3_lora.yaml false positive The HuggingFace model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 triggers the Base64 High Entropy String detector as a false positive. * Revert "fix: improve Kimi VL model robustness and MoE parallelizer defaults" This reverts commit a185e34. * test: add unit tests for new FLOPs calculators and update nemotronh values Add 38 new tests covering: - minimax_m2_flops (basic, gbs scaling, MTP, precomputed) - qwen3_5_flops MoE and Dense variants (GDN/full attention hybrid) - mla_moe_flops (Kimi K2, GLM-5, Mistral Small 4) - step3_5_flash_flops (hybrid full/SWA + MoE) - deepseekv3_flops DSA sparse attention extension - _mamba_layer_flops refactored formula - _hybrid_model_flops conditional accumulation - VL composite config text_config fallback in qwen3_flops and qwen3_5_flops - get_flops_formula_for_hf_config dispatch for new model types Update nemotronh precomputed values to match refactored Mamba scan formula. * docs: add copyright headers, recipe declaration, and run instructions to benchmark configs Add Apache 2.0 copyright header, recipe: BenchmarkingRecipeForNextTokenPrediction declaration, and automodel run instructions to all 21 new benchmark configs. * ci: update secrets baseline line number after copyright header addition * fix: address claude-bot review — rsplit bug, duplicate trust_remote_code, tests - Fix _infer_vocab_size string _target_ import: use rsplit to correctly extract module_path and class_name instead of discarding class name - Remove duplicate trust_remote_code under config: section in 5 YAML configs (step_3.5_flash, qwen3.5_moe, qwen3.5_moe_te_deepep, qwen3.5_moe_lora, step35flash_lora) - Add unit tests for string _target_ and VL text_config fallback paths in _infer_vocab_size * fix: address round-2 claude-bot review comments - Use from_pretrained() instead of direct constructor call for callable config targets in _infer_vocab_size - Guard mla_moe_flops VL text_config fallback with num_hidden_layers check, consistent with qwen3_flops and qwen3_5_flops patterns * test: add VL composite config fallback test for _precompute_stage_shapes Verify that VL configs without hidden_size at root level correctly fall back to text_config for pipeline stage shape precomputation. * fix: respect moe_layer_freq as frequency in _build_moe_layer_pattern When moe_layer_freq is an integer, use it as a frequency (every Nth layer is MoE) rather than marking all post-dense layers as MoE. * Update nemo_automodel/recipes/llm/benchmark.py * fix: restore _infer_vocab_size callable/string config_target branches The previous commit accidentally broke _infer_vocab_size by removing the callable config_target branch and introducing duplicate code with bad indentation, causing a SyntaxError. Restore the correct if/elif structure. --------- Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
fix: fixing the pooling error for non-llama models for biencoder training (#1645) * fix pooling errow for non-llama models * small fix in the biencoder yaml * set add_eos_token to false Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: rnyak <16246900+rnyak@users.noreply.github.com>
ci: Address timeout is ci tests (#1733) Address timeout is ci tests Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
…ess_group() (1730)` into `r0.4.0` (#1739) test: Checkpoint robustness skips atexit-registered destroy_process_group() (#1730) * Explicity tear down in checkpoint robustness * Skip atexit-registered destory_process_group --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix (#1710) * feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. * chore: remove benchmark scripts not needed for this PR * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. * style: run ruff format on all changed source and test files * style: add missing copyright headers to test files * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing * cleanup: remove verbose comments from packing recipe yamls * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py * style: ruff format on all changed files * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations * fix: log actual processor type before falling back to default * fix: remove unused sum() variables flagged by ruff F841 * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR #1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Add lora recipes for gemma4 (#1731) Add lora recipes for gemma4 Signed-off-by: Abhishree Thittenamane <athittenaman@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <athittenaman@cw-dfw-cs-001-login-02.cm.cluster>
test: add vLLM deployment tests for checkpoint robustness (#1656) * test: add vLLM deployment tests for checkpoint robustness vLLM deployment verification tests that load consolidated checkpoints and compare greedy output token-for-token against HuggingFace. Supports both full comparison and smoke test mode. Depends on checkpoint robustness PR #1606. * Create deploy-test dependency group * Revert deploy test group * Move configs to recipes and create vllm_launcher * Setup deploy environment * Remove duplicate keys * Add scope to vllm deploy test * Drop needs dependency * Use finetune test name for ckpt dir * Make ckpt checking more robust * Pass arguments correctly * Update arguments * Remove unused file --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…1729)` into `r0.4.0` (#1756) build: drop rc0 pre-release tag and add dynamic git versioning (#1729) * build: drop rc0 pre-release tag and add dynamic git versioning Append +<short-sha> to __version__ at import time using only the git binary. Falls back silently if git is unavailable or not in a repo. Set NO_VCS_VERSION=1 to opt out (e.g. for release builds). * ci: pin build-test-publish-wheel to FW-CI-templates@7a6fd6d Temporarily pins to the commit that sets NO_VCS_VERSION=1 in the build step, fixing the sdist/wheel version mismatch introduced by dynamic git versioning. Will be replaced with a version tag once FW-CI-templates NVIDIA-NeMo/FW-CI-templates#443 is released. * style: add blank line before if block (ruff E303) * fix: suppress I001 import-sorting false positive in package_info.py * ci: pin FW-CI-templates to v0.88.1 --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…to `r0.4.0` (#1754) fix: Baichuan2 checkpoint robustness test CI failures (#1727) * fix: checkpoint robustness test CI failures - Add trust_remote_code: true to baichuan ci.checkpoint_robustness - Add hf_device_map_auto: true to nemotron nano configs - Bump robustness global_batch_size 16→32 for multi-node compatibility - Remove hardcoded trust_remote_code=False that broke tokenizer loading - Fix dotted keys in ci.checkpoint_robustness being silently ignored (e.g. distributed.tp_size, dataset.limit_dataset_samples) * fix: Baichuan2 checkpoint robustness test CI failures - Register MLP-only TP plan for BaichuanForCausalLM (NormHead is not nn.Linear, W_pack has non-interleaved QKV layout — both incompatible with ColwiseParallel) - Fix HF remote code meta-tensor issue: RotaryEmbedding creates inv_freq/cos_cached/sin_cached as plain attributes that stay on meta device; added _fix_meta_rotary_embeddings helper for Phase 4 - Set appropriate KL/loss thresholds for Baichuan2 with TP=2 * fix: Baichuan2 PEFT checkpoint robustness test CI failures - Apply _fix_meta_rotary_embeddings to PEFT base model loading path - Add KL/loss thresholds to baichuan_2_7b_squad_peft.yaml CI config * fix: remove unused cross-TP/resume settings from Baichuan2 PEFT config Cross-TP and resume assertion are skipped for PEFT models in the test. * fix: add gc.collect() before torch.cuda.empty_cache() in checkpoint robustness test FSDP2/DTensor circular references prevented GPU memory from being freed between test phases, causing OOM on large models (e.g. Nemotron Super 120B) when Phase 4 tries to reload via vanilla HF with device_map="auto". * fix: PEFT checkpoint restore for MoE models with activation checkpointing - Strip _checkpoint_wrapped_module. from FQNs in _get_peft_state_dict and _set_peft_state_dict to match DCP's normalization. Without this, expert LoRA weights are silently skipped on reload when activation checkpointing is enabled (keys mismatch), causing KL divergence of ~0.5. - Wire up no_check_hf flag to skip Phase 4 vanilla HF check when configured - Qwen3 MoE 30B LoRA: reduce to 1 node, add no_check_hf * fix: Qwen3 MoE PEFT adapter HF compatibility via ParamWrapper format Save Qwen3 MoE expert LoRA adapters in PEFT v0.18+ ParamWrapper format so PeftModel.from_pretrained() can load them directly. Previously, adapters were saved with per-expert individual keys (experts.0.gate_proj.lora_A.weight) which vanilla HF couldn't load because Qwen3 MoE uses fused nn.Parameter tensors (experts.gate_up_proj), not individual nn.Module per expert. The new format (default, v4_compatible=False) uses target_parameters in adapter_config.json and 2D fused LoRA tensors matching ParamWrapper's expected key layout. Legacy per-expert format is preserved when v4_compatible=True. Also: reduce Qwen3 MoE CI from 2 nodes to 1, remove dead no_check_hf parsing from test, clean up _extract_target_modules helpers. * fix: remove debug print statement from checkpoint robustness test --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1758) ci: Address container and source code cve (#1753) * Address container and source code cve * Update uv lock --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
….4.0` (#1759) ci: Update test timeout and add ci_tests readme (#1752) * Update test timeout and add ci_tests readme * Space out the finetune logging --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
fix: Update lora configs for gemma4 (#1748) Update lora configs for gemma4 Signed-off-by: Abhishree Thittenamane <athittenaman@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <athittenaman@cw-dfw-cs-001-login-02.cm.cluster>
…`r0.4.0` (#1772) fix: launcher option from being consumed as a config override. (#1766) --nproc-per-node goes to app not yaml Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…0` (#1771) fix: skip embedding[padding_idx] = 0 with TP (#1675) * skip embedding[padding_idx] = 0 * fix * remove code --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
ci: add missing recipe owners (#1775) * add missing owners to recipes for CI notifications. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
ci: Resolve cve and remove uv cache (#1774) * Resolve commit and remove uv cache * Update uv lock --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
feat: FSDP2 w weight prefetching and async TP optimization (#1711) * feat: FSDP2 w weight prefetching and async TP optimization * remove deferred rs feature * add datapoints * lint * fix unit tests * address claude review * remove invalid tests and better readbility * skip unused fsdp flag * Apply suggestions from code review * refactor: use nn.Module.compile() and consolidate compile paths in infrastructure * refactor: remove fsdp_layer_group_size flag * derive pp_enabled * lint * update cp and fix * lint * update perf * update * update perf * fix test --------- Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: update yamls for vllm_deploy (#1780) * update configuration * ensure checkpoint is saved as consolidated safetensors * fix ci --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…gs (1791)` into `r0.4.0` (#1792) docs: Add nightly CI test summary for LLM and VLM finetune configs (#1791) * Add nightly ci test summary * Visual updates to summary * fix: Visual improvments --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4.0` (#1795) fix: Add per-tensor conversion in gemma4 state_dict_adapter.py (#1764) Add per-tensor conversion in gemma4 state_dict_adapters convert_single_tensor_to_hf Signed-off-by: Abhishree Thittenamane <athittenaman@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <athittenaman@cw-dfw-cs-001-login-02.cm.cluster>
…`r0.4.0` (#1794) feat: Enable benchmark CI testing with llm_benchmark and vlm_benchmark (#1793) * eat: Enable benchmark CI testing with llm_benchmark and vlm_benchmark folders * fix: False positive secret detection * feat: Add benchmark test artifact generation * Fix: Ensure that perf configs are not overwritten * fix: Move artifact generation logic to finetune_launcher * skip broken test Comment out the skip_if_no_mamba decorator and add pytest.skip. --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…ti-GPU init (1769)` into `r0.4.0` (#1797) fix: `NotImplementedError: aten::equal` on meta tensors during multi-GPU init (#1769) * fix: `notimplementederror: aten::equal` on meta tensors during multi-gpu model init with transformers >= 5.4.0 (#1765) * fix: `notimplementederror: aten::equal` on meta tensors during multi-gpu model init with transformers >= 5.4.0 (#1765) * Add pull_request types to workflow triggers * Apply suggestion from @akoumpa --------- Signed-off-by: Harsha Pasham <pashamharsha018@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Harsha Pasham <53609097+harshareddy832@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…5)` into `r0.4.0` (#1812) fix: Restrict auto-discovery scopes in generate_ci_tests.py (#1805) Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4.0` (#1816) ci: RC6 timeout fixes for release test recipes (#1801) Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pes (1818)` into `r0.4.0` (#1819) ci: Increase benchmark timeout for GLM and Qwen3.5 MoE LoRA recipes (#1818) Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: nemo-ci Bot <nemo-ci-bot@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
beep boop 🤖: Updating transformers to latest version on pypi