Merge latest changes#1
Open
flaviusburca wants to merge 305 commits intoinvergent-ai:mainfrom
Open
Conversation
* feat: update cce to include olmo family * chore: update docs following feedback * feat: add olmo3 config * fix: clarify 3 methods * chore: add olmo to readme
Co-authored-by: Ved <ved.work2024@gmail.com>
* feat: upgrade peft to 0.18.0 * feat: add peft_ensure_weight_tying * fix: default * chore: adjust kwarg per feedback
Co-authored-by: SalmanMohammadi <25081738+SalmanMohammadi@users.noreply.github.com>
* feat: add exaone4 chat template and update enums * fix: handle first message as system or tools in exaone4 chat template * Update src/axolotl/utils/chat_templates/templates/exaone4.jinja Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix: lint --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: NanoCode012 <nano@axolotl.ai>
* feat: add ministral and mistral3 * chore: lint * feat: update cce for ministral * fix: add vram usage * feat: update for release * fix: save_pretrained issue in v5 * fix: add instructions to use v5 branch * fix: add to multipack * fix: improve instructions * fix: add model to readme
* fix: improve ministral3 docs to be clearer * fix: title * chore: wording
* fix bin size * lint --------- Co-authored-by: Ved <ved.work2024@gmail.com>
* fix: update qwen3 jinja tokenization off a few tokens * fix: add note on tokenization issue * fix: pop last index for mistral tokenizer
* support for xformers wheels for torch 2.9 * fix hf cache? * don't use hf cache from s3 * show disk free space in ci
* fix: leftover ministral docs changes * fix: pytorch_cuda_alloc_conf deprecation * fix: set old PYTORCH_CUDA_ALLOC_CONF env too * handle 2.9 separately --------- Co-authored-by: Wing Lian <wing@axolotl.ai>
* add configs for blogpost * fix configs * fixing baseline configs
* Add `peft_autocast_adapter_dtype` field to schema * Add `autocast_adapter_dtype` to `model_kwargs` * chore: docs --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
* fix: Fix evaluation loss in KD trainer * Fix v2 strategy super() call * fix: Add safety check for total_tokens in log method * fix: simplified num items and outputs return handling * fix: add missing model forward pass in compute_loss * refactor: Use Template Method pattern for chat template strategies * refactor: use pop(None) and remove v2 override * chore: lint --------- Co-authored-by: NanoCode012 <nano@axolotl.ai> Co-authored-by: Wing Lian <wing@axolotl.ai>
* Import math and compute perplexity from loss values * lint * coderabbit changes * lint * fix: add rounding to ppl --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
* add liger kernal 4 dpo * revert grpo changes,add support in dpo * revert grpo changes,add support in dpo * dpo_use_liger_kernal * fix liger_dpo --------- Co-authored-by: Ved <ved.work2024@gmail.com>
* init * working * updating configs * removing unneeded files * lint * comments * lint * fix regex match * bump contribs version * comments * fixing tests and imports * muon imports in test v2 * test cleanup * bump contribs version --------- Co-authored-by: Salman Mohammadi <“salman.mohammadi@outlook.com”>
* fix preview docs failing due to running out of disk * fix docs publish too
It offers installing densemizer while it should be densemixer
* METRIC_PRECISION-> 8 * use ndigits and move env getter to top of log function --------- Co-authored-by: Ved <ved.work2024@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>
* fix check for fp8 capability * handle non-cuda compute * reduce concurrency of tests
* feature: raise on long sequence drop It is sometimes not desired that sequences are silently dropped from the dataset, especially when the dataset has been carefully crafted and pre-fitted for the training context. This would then suggest that an error occurred somewhere in the process. This feature adds a third value for excess_length_strategy called 'raise', which will raise a ValueError if a sequence is encountered that is too long and would have normally been dropped/truncated. * tests: add excess_length_strategy tests * doc: updated return value description for drop_long_seq_in_dataset * add @enable_hf_offline * fixed cfg modified after validate_config called * hf offline fix * fix tqdm desc when raise is used * test: added test for non-batched case * accidental code change revert * test: use pytest.raises * test: simplified drop_seq_len tests * test: moved excess_length_strat test to test_data.py --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
* feat: add trackio as experiment tracking integration - Add TrackioConfig to integrations schema with project_name, run_name, and space_id - Create trackio_.py module for environment setup - Add is_trackio_available() utility function - Integrate trackio with report_to in trainer builder - Add trackio callback for experiment tracking - Add trackio config keys to gpt-oss example YAMLs - Trackio runs locally by default, syncs to HF Space if space_id provided * changes * changes * changes * changes * changes * changes * changes * Update requirements.txt * don't allow pydantic 2.12 for now --------- Co-authored-by: Abubakar Abid <aaabid93@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>
* feat: add custom kimi linear patch [skip ci] * feat: add configuration file and fix import [skip ci] * fix: hijack tokenizer temporarily [skip ci] * chore: remove accidental commit * fix: attempt patch kimi remote * fix: kwargs passsed * fix: device for tensor * fix: aux loss calculation * feat: cleaned up patches order * fix: remove duplicate tokenizer patch * chore: add debug logs * chore: add debug logs * chore: debug * Revert "chore: add debug logs" This reverts commit da372a5. * Revert "chore: add debug logs" This reverts commit 97d1de1. * fix: KeyError: 'tokenization_kimi' * fix: support remote_model_id in cce patch * feat: add config preload patch * fix: use standard aux loss calc and updated modeling * fix: import * feat: add kimi-linear docs and example * chore: add note about moe kernels * feat: update cce to include kimi-linear * chore: lint * chore: update main readme * fix: patch mechanism to address comments * chore: lint * fix: tests * chore: cleanup comment
…#3330) [skip-ci] * feat: add pos id to flex attention for packing part 1 * feat: update to include sliding window mask patch * fix: suppress MatMul8bitLt: inputs will be cast from warnings * fix: remove redundant flex attention patch * chore: update olmo docs * feat: add validator patch for cross entropy
* feat: add internvl3_5 * fix: add timm instructions * chore: add kimi-linear to cce doc * feat: update internvl example * chore: pin revision * chore: remove from multipack * fix: add to multimodal array * fix: internvl use hf version * feat: update cce * chore: lint * fix: list for image_size * chore: add docs vram usage * feat: enable cce * fix: no need trust remote code * fix: inconsistent timm version
* qwen3_5.jinja: handle list content on system messages The system message branch used string concatenation on messages[0].content, which breaks when the first system message uses the OpenAI-style list-of-parts format that multimodal datasets require. User and assistant branches already handle both string and list content, but the system branch did not. Check whether content is a string and fall back to iterating over parts when it is a list, matching the pattern used for user messages. Fixes #3590 * Address pr for other content types --------- Co-authored-by: Joaquin Hui Gomez <joaquinhuigomez@users.noreply.github.com> Co-authored-by: Wing Lian <wing@axolotl.ai>
…E Triton kernels (#3598)
* better handling of dora merge on Conv layers in Qwen 3.5 * address issues from code review * stricter efficient merges for dora since we now have meta model to reference
… ci] * Skip redundant evaluation when resuming from checkpoint * add condition check for adding callback --------- Co-authored-by: Wing Lian <wing@axolotl.ai>
… [skip ci] Allow loading FP8-quantized models (e.g. Mistral-Small-4-119B) with FineGrainedFP8Config and optional dequantize kwarg for full fine-tuning. Made-with: Cursor
* feat: support excess_length_strategy for RL trainers Previously, RL data loading always dropped sequences exceeding sequence_len. This adds support for the existing `excess_length_strategy` config option (`drop`, `truncate`, `raise`) in RL training pipelines, matching the behavior already available for SFT. - `drop` (default): unchanged behavior, filters out long samples - `truncate`: tokenizes text components, truncates responses to fit within sequence_len while preserving the full prompt, then decodes back to text. Handles DPO/IPO/ORPO/SIMPO and KTO datasets. - `raise`: raises ValueError if any sample exceeds sequence_len Closes #3547 * improve RL truncation strategy robustness and performance --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Wing Lian <wing@axolotl.ai>
* fix: rename model to adapter_model for fsdp sharded final model * fix: follow upstream transformer shard size * fix: handle multiple model files * fix redundant condition, tighten to safetensors, keep shard size small --------- Co-authored-by: Wing Lian <wing@axolotl.ai>
* bump transformers to 5.5.4 and trl to latest 1.1.0 * more upgrades * update peft too * adapt lora_merge to peft 0.19 layer config API PEFT 0.19 requires a LoraConfig object on Linear/ParamWrapper/Conv layer constructors and moved use_rslora, use_dora, fan_in_fan_out, lora_dropout, and lora_bias into that config. Build the config per branch in _build_peft_layer_and_get_delta so the merge utility works with the upgraded peft. * allow lora_dropout on mixed attention+MoE configs under peft 0.19 PEFT 0.19's convert_peft_config_for_transformers auto-remaps old MoE target_modules (w1/w2/w3 on Mixtral, etc.) into target_parameters for transformers v5's fused 3D expert Parameters. Those targets get wrapped with ParamWrapper, which rejects lora_dropout != 0 because the 3D einsum can't factor dropout out of lora_B(lora_A(dropout(x))). Monkeypatch ParamWrapper.__init__ to internally use a copy of the LoraConfig with lora_dropout=0, so its dropout slot becomes nn.Identity while the shared config still delivers real dropout to sibling Linear LoRA layers (attention q/k/v/o). A probe runs the same conversion on a deep copy to detect the situation and emit a warning before patching.
* feat: move to uv first * fix: update doc to uv first * fix: merge dev/tests into uv pyproject * fix: update docker docs to match current config * fix: migrate examples to readme * fix: add llmcompressor to conflict * feat: rec uv sync with lockfile for dev/ci * fix: update docker docs to clarify how to use uv images * chore: docs * fix: use system python, no venv * fix: set backend cpu * fix: only set for installing pytorch step * fix: remove unsloth kernel and installs * fix: remove U in tests * fix: set backend in deps too * chore: test * chore: comments * fix: attempt to lock torch * fix: workaround torch cuda and not upgraded * fix: forgot to push * fix: missed source * fix: nightly upstream loralinear config * fix: nightly phi3 long rope not work * fix: forgot commit * fix: test phi3 template change * fix: no more requirements * fix: carry over changes from new requirements to pyproject * chore: remove lockfile per discussion * fix: set match-runtime * fix: remove unneeded hf hub buildtime * fix: duplicate cache delete on nightly * fix: torchvision being overridden * fix: migrate to uv images * fix: leftover from merge * fix: simplify base readme * fix: update assertion message to be clearer * chore: docs * fix: change fallback for cicd script * fix: match against main exactly * fix: peft 0.19.1 change * fix: e2e test * fix: ci * fix: e2e test
…th under activation check… (#3611) * [gemma4] fix VRAM leak in hybrid FA2+SDPA path under activation checkpointing Route shared_kv_states through a thread-local side channel instead of the decoder-layer kwargs so the checkpoint partial never references the dict. HF's Gemma4TextModel.forward passes shared_kv_states (a mutable dict used for cross-layer K/V sharing) as a kwarg to every decoder_layer call. GradientCheckpointingLayer.__call__ then forms partial(super().__call__, **kwargs), and whichever checkpoint runs (axolotl's CPU_Offloaded_Gradient_Checkpointer or torch's stock checkpoint) captures that partial. The partial holds a reference to the dict, which holds the K/V tensors produced by store_full_length_kv layers. Those tensors stay pinned for the full duration of backward, and delayed ref-cycle cleanup in torch's caching allocator under FSDP2 + activation checkpointing bleeds the residual across steps. Observed symptom: VRAM climbs ~0.47 GiB/step from a 42 GiB baseline, OOMs around step 73 (~94 GiB peak) on Gemma-4 31B multimodal with gemma4_hybrid_attn_impl: true. Independent of seq len / image size. All-flex-attention path is flat but ~22x slower. Violated invariant: anything crossing an activation-checkpoint boundary must be a tensor (refcounted by autograd) or plain Python data -- never a mutable container holding tensor references. Fix (all in src/axolotl/monkeypatch/models/gemma4/fused_attn.py): * threading.local() store with _get/_set_shared_kv_states helpers * _patch_decoder_layer_call(): monkeypatches Gemma4TextDecoderLayer.__call__ to pop shared_kv_states from kwargs and stash it in TLS before delegating to GradientCheckpointingLayer. The partial formed downstream no longer references the dict. * fused_forward reads TLS first, falls back to kwarg for callers that bypass the patched __call__ (e.g. direct attention invocation). * wired into patch_gemma4_fused_attn; idempotent via a sentinel. TLS is overwritten on each new step's first decoder-layer call, so the previous step's dict is released promptly. No changes to hybrid dispatch, FSDP wrap policy, or any config behaviour. Works for hybrid, flex, and eager paths. Introduced by PR #3598 (commit b8358aa). * Coderabbit comment: gemma4: clear TLS unconditionally in decoder-layer patched __call__ Overwrite the thread-local shared_kv_states store on every invocation (including with None) instead of only when the kwarg is present. The previous conditional write left stale dicts in TLS on any path that reaches Gemma4TextDecoderLayer.__call__ without a shared_kv_states kwarg — e.g. generation, eval hooks, or future HF refactors that make the kwarg optional. fused_forward would then silently consume a prior step's K/V dict instead of falling back to its own kwarg path. Unconditional write makes the invariant in the surrounding comment ("TLS is overwritten on each new step's first decoder-layer call, so the previous step's dict is released promptly") actually hold. No behavior change for the training happy path, which always passes the kwarg. Addresses CodeRabbit review on PR #3611 * fix: swap threading.local() for module-level store so autograd worker threads see shared_kv_states during backward recompute Previous commits fixed memory leak on 31B but caused type error with MOE Gemma4 variants - this fixes that: PR 3611's TLS variant only works when recompute runs on the same thread that set TLS during forward. PyTorch's C++ autograd engine (_engine_run_backward) spawns per-device worker threads to dispatch backward, and HF-Trainer gradient_checkpointing (stock torch.utils.checkpoint, non-reentrant / saved-tensor-hooks) fires unpack_hook -> recompute_fn on those worker threads. TLS set on the main thread during forward is invisible there, so _get_shared_kv_states() returns None and the consumer-layer lookup crashes with "'NoneType' object is not subscriptable" at fused_attn.py:97 (shared_kv_states[self.kv_shared_layer_index]). A plain module-level dict is visible to all threads in the process. Lifecycle is identical: the slot is overwritten each forward, releasing the previous step's dict and allowing its K/V tensors to be GC'd, so the original VRAM-leak fix still holds under FSDP2 AC too. * scope gemma4 shared_kv_states side channel to checkpointed training Update PR #3611 with gate for checkpointed training to avoid regressions across async flows. Added unit tests for kwargs pop, store-clear regression, and flag gating. Condensed verbose comments * add gemma4 cross-thread visibility test for shared_kv_states store Additional regression test for MoE gemma4 variants - asserts the module-level store is readable from threads other than the one that set it in response to previously observed 'NoneType' error * fix logger --------- Co-authored-by: Wing Lian <wing@axolotl.ai>
* train on remote compute using Tinker compatible APIs * chore: lint * fixes with latest hatchery changes * chore: lint
* Support loss_type/loss_weights DPO * Validate dpo loss type/weights only set for dpo * Tests: Update ipo tests to use new path * Docs: Update docs for new ipo path * PR fixes - typo/validation * PR nit - warning * chore: fix warnings arg --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
* fix dpo collation/padding * fix DPO collator encoder-decoder pixel_values dtype and is_encoder_decoder detection - Use float32 instead of LongTensor for _pixel_values in encoder-decoder branch - Add missing padding_value case for _pixel_values in encoder-decoder branch - Derive is_encoder_decoder from model config instead of hardcoding False
* fix: uv leftover docs * fix: docker build failing * chore: doc * fix: remove old pytorch build * fix: stop recommend flash-attn optional, let transformers pull * fix: remove ring flash attention from image * fix: quotes [skip ci] * chore: naming [skip ci]
* use smaller pretrained models for ci * more steps for loss check * fix tests * more train steps * fix losses
* fix: clarify incompat * fix: transformers api change upstream * fix: add pre prop * feat: add examples * chore: cleanup * chore: update readme
* add bitnet * switch to uv * chore: liint --------- Co-authored-by: Wing Lian <wing@axolotl.ai>
* add bitnet config * chore: lint --------- Co-authored-by: Wing Lian <wing@axolotl.ai>
…lity/concerns feature flags (#3602) * upgrade to torchao 0.17.0 * chore: lint * refactor attention handling * replace legacy attention boolean flags with capability properties Replace checks with capability-based properties derived from attn_implementation This separates three concerns that were conflated under flash_attention: 1. Backend selection -> attn_implementation enum 2. Packing capability -> attn_supports_packing property 3. Flash-attn library dependency -> attn_uses_flash_lib property * compute attn capability flags in normalizer instead of properties * make attn_implementation the single source of truth * move attention-dependent validators to mode=after * migrate remaining consumers to canonical attn_implementation * expand attention tests + rewrite docs * migrate example configs to canonical attn_implementation * update doc snippets + reject gemma4-hybrid with non-FA2 backend * remove dead gemma4 branch in _set_attention_config * fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests * drop "Phase 2" naming from attn-implementation tests * regroup attn_implementation tests by feature concern * clean up verbose comments and remove MD Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai> * fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x In transformers 5.x, ProcessorMixin.apply_chat_template gained its own `return_dict` parameter (defaulting to False). When return_dict=False and tokenize=True the method returns out["input_ids"] directly — a 2-D tensor — rather than the full BatchFeature dict. The old code placed `return_dict=True` inside processor_kwargs. In transformers 5.x those kwargs are forwarded to the underlying processor call self(...) where _merge_kwargs silently ignores any key not present in MllamaProcessorKwargs (emitting a warning). The outer return_dict therefore stayed False, apply_chat_template returned the raw input_ids tensor, and the subsequent `batch["input_ids"]` attempted to index a 2-D tensor with the 9-character string "input_ids", producing: IndexError: too many indices for tensor of dimension 2 The fix is to pass return_dict=True as a top-level keyword argument to apply_chat_template (where it is actually consumed) and remove it from processor_kwargs (where it was silently dropped). No version guard is needed: transformers is pinned to ==5.5.4 in pyproject.toml. Adds a unit-level regression test (tests/test_mm_chat_collator.py) that mocks the processor to return a raw tensor when apply_chat_template is called without top-level return_dict=True, verifying the four invariants: process_rows returns a dict, input_ids is 2-D, labels is 2-D, and apply_chat_template receives return_dict=True as a top-level kwarg. Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai> * fix(collator): process_rows returns dict (BatchFeature) shape Two related changes for the multimodal chat collator under transformers 5.x: 1. Wrap apply_chat_template result in dict(...) so process_rows returns a plain dict rather than a BatchFeature instance. BatchFeature is a Mapping but not a dict; downstream code that did batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"]) would index on a tensor when the result wasn't dict-shaped, raising IndexError: too many indices for tensor of dimension 2 2. Soften the regression test's contract from `dict` to `Mapping` so it exercises the actual semantic guarantee (key/value access) rather than the implementation detail (dict vs BatchFeature). Test guards against the original transformers 5.x breakage where apply_chat_template's return_dict default went from True to False. Includes regression test under tests/test_mm_chat_collator.py. Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against attn-implementation-refactor; squash-merged from agent commits 4de886fd + dc9fcf4f. Signed-off-by: Wing Lian <wing@axolotl.ai> --------- Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* memory clean up for fsdp full state dict * Update src/axolotl/monkeypatch/accelerate/fsdp2.py Co-authored-by: Wing Lian <wing.lian@gmail.com> --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
…daries` (#3625) * feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries Fixes silent ignoring of `cfg.train_on_inputs` / `cfg.roles_to_train` / `cfg.train_on_eos` in the multimodal training path. Before this branch, only Gemma 3n honored these knobs; every other VLM trained on the full sequence regardless of config. Also adds `cfg.role_boundaries` YAML override so users can declare per-role markers without subclassing. What changed ------------ - `ProcessingStrategy` gains a declarative boundary scanner. Each strategy declares per-role start/end markers via `_build_role_boundaries`; the shared scanner honors `train_on_inputs` / `roles_to_train` / `train_on_eos` (incl. "last"). - New per-template strategies: Gemma 4, Llama 3.2 Vision, Llama 4, Pixtral, Mistral V7 Tekken. - Refactored: Gemma 3 (previously no role masking), Gemma 3n (previously ad-hoc scanner, now shared). - Strategies whose boundary tokens couldn't be verified offline (Voxtral, SmolVLM2, Mistral3, InternVL, GLM4V, llava/lfm2vl fallback) retain legacy behavior and emit a one-shot warning. Users can enable masking on them via `cfg.role_boundaries`. - Pixtral / Mistral V7 Tekken correctly handle the shared `[/INST]` token between user-end and assistant-start via `include_end=False` + scanner rewind. See `docs/multimodal_assistant_mask.md` for the full audit table, root-cause analysis, and design rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries Fixes silent ignoring of `cfg.train_on_inputs` / `cfg.roles_to_train` / `cfg.train_on_eos` in the multimodal training path. Before this branch, only Gemma 3n honored these knobs; every other VLM trained on the full sequence regardless of config. Also adds `cfg.role_boundaries` YAML override so users can declare per-role markers without subclassing. What changed ------------ - `ProcessingStrategy` gains a declarative boundary scanner. Each strategy declares per-role start/end markers via `_build_role_boundaries`; the shared scanner honors `train_on_inputs` / `roles_to_train` / `train_on_eos` (incl. "last"). - New per-template strategies: Gemma 4, Llama 3.2 Vision, Llama 4, Pixtral, Mistral V7 Tekken. - Refactored: Gemma 3 (previously no role masking), Gemma 3n (previously ad-hoc scanner, now shared). - Strategies whose boundary tokens couldn't be verified offline (Voxtral, SmolVLM2, Mistral3, InternVL, GLM4V, llava/lfm2vl fallback) retain legacy behavior and emit a one-shot warning. Users can enable masking on them via `cfg.role_boundaries`. - Pixtral / Mistral V7 Tekken correctly handle the shared `[/INST]` token between user-end and assistant-start via `include_end=False` + scanner rewind. See `docs/multimodal_assistant_mask.md` for the full audit table, root-cause analysis, and design rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+types: address CodeRabbit nitpicks on PR #7 - builders/causal.py: add inline NOTE that multi-dataset configs reuse the first dataset's masking knobs (roles_to_train / train_on_eos) for all datasets — heterogeneous per-dataset overrides are not supported in the MM path today. - processing_strategies.py: annotate inner scanner helpers _match_prefix and _find_end with explicit types (Tensor, int, list[int] → bool / tuple[int, bool]) for readability. - docs/multimodal_assistant_mask.md: renumber the "Commits on this branch" list to 1-7 consecutive (previously skipped 3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(mm-mask): address two CodeRabbit findings on PR #7 1. Schema rejected `train_on_eos: "none"` despite the scanner honoring it. `_VALID_TRAIN_ON_EOS` accepts "none" and the design doc lists it, but `SFTDataset.train_on_eos` was `Literal["all", "turn", "last"]`, so YAML users hit a pydantic ValidationError at config load. Added "none" to the Literal and updated the description. 2. `cfg.role_boundaries: []` had split-personality semantics: the strategy ctor treated it as "replace built-ins with empty" while the collator plumbing treated it as "unset", and both the design doc and the MultiModalConfig schema help text promised wholesale replacement for any set value. Aligned on opt-in semantics across all four surfaces — a non-empty list replaces built-ins wholesale; unset or `[]` falls back to built-ins. Rationale: honoring `[]` literally yields all-masked labels and zero gradient, which is almost always a typo or leftover rather than a deliberate user action. Users who want to disable role masking should unset the field or use `train_on_inputs: true`. Also sharpened the fallback one-shot warning for strategies without built-in boundaries: names the consequence ("only pad and media tokens are masked, every other token contributes to loss") and points users at `cfg.role_boundaries` + docs/multimodal_assistant_mask.md instead of "see axolotl/processing_strategies.py for how to declare boundaries." Files: - src/axolotl/utils/schemas/datasets.py: Literal adds "none" - src/axolotl/processing_strategies.py: ctor truthiness check on role_boundaries_override; sharpened fallback warning - src/axolotl/utils/schemas/multimodal.py: role_boundaries description now calls out opt-in + empty-list fallback semantics - docs/multimodal_assistant_mask.md: same clarification in the Semantics block; updated the fallback-path detection paragraph to quote the new warning text - tests/test_processing_strategies.py: +2 regressions (test_sft_dataset_schema_accepts_all_supported_train_on_eos_values, test_empty_role_boundaries_override_falls_back_to_builtin); 63/63 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * doc cleanup * fix(mm-mask): CodeRabbit findings + lint fix on PR #3625 Pre-commit failure: trailing newline missing on docs/multimodal_assistant_mask.md (end-of-file-fixer hook). Six CodeRabbit findings addressed: 1. Scanner: non-trainable role's end marker ignored ``include_end``. Under ``train_on_eos="all"``, the shared ``[/INST]`` token (user-end with ``include_end=False``, intentionally re-matched as assistant-start) leaked into loss via the user branch on Pixtral / Mistral V7 Tekken. Fix: gate the non-trainable branch on ``best_match.include_end`` to mirror the trainable branch. 2. Gemma3 ``boi_token`` lookup used ``tokenizer.special_tokens_map.get("boi_token")``, which never fires on real checkpoints (``special_tokens_map`` only holds HF's standard slots — bos/eos/pad/unk/...). Swap to direct attribute read ``getattr(tokenizer, "boi_token", None)``, matching what ``transformers.models.gemma3.processing_gemma3`` itself does. Updated the ``_gemma_tokenizer`` test fixture to mirror real-model shape so the test exercises the production code path. 3. GLM dispatcher only registered ``Glm46VProcessor`` (GLM-4.6V / GLM-4.7V). Real ``Glm4vProcessor`` (GLM-4V / GLM-4.1V) users fell through to the base fallback. Both processors ship identical media-token markers, so register both under the shared ``Glm4vProcessingStrategy`` with independent try/except import blocks. Updated class docstring. +2 dispatcher regressions. 4. Gemma3 ``process_labels`` hardcoded 262144 for the soft image token. Resolve dynamically via ``tokenizer.convert_tokens_to_ids("<image_soft_token>")`` with unk-id guard; fall back to 262144 only if the string isn't in vocab. Mirrors ``Gemma4ProcessingStrategy.process_labels`` pattern. 5. ``build_collator`` was called twice per ``build()`` (eval + train passes), producing two identical ``MM collator: ...`` INFO banners on startup. Gate the log on ``is_eval=False`` so only the training pass emits it. 6. Removed unused ``_mistral_common_stub`` pytest fixture (13 refs → 0, always returned ``None``; the dispatcher already handles missing ``mistral_common`` via lazy import + ``try/except``). Added ``test_scanner_train_on_eos_all_with_non_trainable_include_end_false`` — a focused scanner-level lock-in for finding #1, independent of any specific VLM strategy. Test count: 63 → 68 passing. Local ``pre-commit run --all-files`` green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(mm-mask): hoist .tolist() out of scanner; shorten comments/docstrings - Scanner perf: convert labels[i] to a Python list once per row so _match_prefix / _find_end operate on list slices instead of re-materializing Tensor slices via .tolist() on every probe. Cuts O(n*boundaries) CPython↔C boundary crossings per batch. - Markdown lint (MD001, MD040): promote two h3 section headings to h2 under the h1; add `text` language to the verify-at-runtime fenced block. - Shorten verbose comments/docstrings added in recent commits to bare-minimum "why" notes matching the repo's existing style. 68/68 tests, 8/8 pre-commit hooks still pass.
* Fix Axolotl ReLoRA optimizer reset scope * fix: make relora reset method honor relora_prune_ratio When relora_prune_method='reset' and relora_prune_ratio is explicitly set, the ratio was silently ignored and replaced with the hardcoded _FULL_RESET_RATIO (0.999). Fix by moving the default-ratio logic to ReLoRACallback.on_step_begin: None maps to _FULL_RESET_RATIO for reset and 0.9 for other methods. reset_optimizer now uses the same random pruning path for both 'random' and 'reset'. Also consolidate three-layer default mismatch: schema default for relora_prune_method is now 'magnitude' (single canonical source); dataclass defaults for both fields changed to None to eliminate the conflicting fallback layer. Tests updated: removed the test case that verified the old broken behavior (reset ignoring ratio), added two cases proving reset honors the passed ratio. E2E reset fixture now uses ratio=0.5 to make it unambiguous that the ratio is honored. * Fix ReLoRA uint8 pruning regression --------- Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Motivation and Context
How has this been tested?
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)