Skip to content

Claude /skills exploration #3749

Description

@thad0ctor

Inspired by Liger's liger-kernel-dev Claude skill (commit).
The reusable pattern: analyze input → structured profile → generate known files → hard validation gate.

Underlying Logic: Good skill candidates are tasks that (a) touch the same handful of files every time and (b) have a
correctness check expressible as concrete commands or a generated test.


1. Liger upstream-sync

Problem: integrations/liger/plugin.py reconciles upstream MODEL_TYPE_TO_APPLY_LIGER_FN against a
hand-maintained override set (axolotl_override_liger_fn, line ~85) and a long elif model_config_type ==
chain of bespoke patches (qwen3_5, gemma4, jamba, deepseek_v2, llama4…). When Liger ships native support for
a type Axolotl still hand-patches, CI stays green while the wrong kernel set is applied — a silent
correctness/perf regression, not a version error. (Seen for real in the liger-0.8.0 bump: qwen3_5 /
qwen3_5_moe / gemma4_text override shadowing.)

Skill: diff installed MODEL_TYPE_TO_APPLY_LIGER_FN against the elif chain + override set; report
"upstream now natively supports X, Axolotl still hand-patches it" and "upstream changed the
apply_liger_kernel_to_X signature (stale swiglu/glu param check)."

Validation gate: for each covered model_config_type, assert which path fires (native vs override) and
that the expected modules (RMSNorm, SwiGLUMLP, FLCE forward) were actually swapped.


2. Prompt-strategy / dataset-format authoring

Problem: Most repetitive contributor task. type: → module via importlib in
prompt_strategies/__init__.py. Every new format touches the same 3 places: a module in
prompt_strategies/, a golden test in tests/prompt_strategies/, a doc page in docs/dataset-formats/.

Skill: scaffold the strategy module, register the type:, generate the doc page, and emit a golden
round-trip test.

Validation gate: axolotl preprocess --debug for label-masking inspection + a generated test asserting
exact token IDs and the masked-label span (the loss-masking bug reviewers can't eyeball).


3. Integration / plugin scaffold

Problem: BasePlugin exposes ~20 lifecycle hooks (pre_model_load, post_model_build,
get_input_args, get_trainer_cls, create_optimizer…). Contributors silently skip hooks.

Skill: scaffold integrations/<name>/{__init__.py, args.py, README.md}, wire get_input_args, add a
config example + load/parse smoke test.

Validation gate: plugin loads, args parse, lifecycle hooks no-op cleanly.


4. Config-field addition

Problem: Bounded but easy to under-test in a config-driven project. docs/config-reference is
auto-generated (docs/scripts/generate_config_docs.py), so the surface is: Pydantic schema in
utils/schemas/, validator, regen docs, test.

Skill: add the field + validator, regenerate docs, scaffold a schema test.

Validation gate: schema round-trips, validator rejects bad combos, docs regenerate clean.


5. "Patch-retirement" check

Problem: Bumps are one-line + CI, but CI won't tell you which entries in monkeypatch/ and integration
shims are now redundant because upstream transformers/trl/peft caught up.

Skill: scan monkeypatch/ + shims, flag candidate-removable patches after a bump.

Validation gate: for each flagged patch, confirm upstream now provides equivalent behavior (test passes
with the patch disabled). Note: largely subsumed by #1 for the Liger elif chain.


6. Chat-template generator

Problem: tests/prompt_strategies/ already has a large chat-template suite (test_chat_templates*.py
thinking, tool-calls, mistral…). Adding a new template repeats an established pattern.

Skill: add a new chat template + auto-generate its rendered + tokenized snapshot test.

Validation gate: snapshot matches; tool-call / thinking / system-prompt variants render correctly.
Low-risk, high-acceptance.


7. Model verification harness

Problem: Threads the maintainer's needle — the bespoke part of model support (attention/RoPE/mask
patches) can't be one skill, but verifying a model can.

Skill: given a model, scaffold tiny LoRA + full configs against a tiny variant, run preprocess + a few
train steps.

Validation gate: loss is sane, sample-packing behaves, Liger/CCE compat flags resolve correctly. A
reproducible smoke gate for model PRs.

---**

✔️ Solution

Was chatting with @NanoCode012 about whether any skills are worth adding getting inspiration from Liger

❓ Alternatives

There are other dev/testing oriented skills consistent across pytorch/sglang/nemo and others that are valid but they would be lesser value:

Training-run diagnostic . The single most common pattern across all four repos: NeMo debug-training-logs, SGLang debug-distributed-hang + debug-cuda-crash, PyTorch distributed-triage + pt2-bug-basher. Axolotl already has the source material (docs/training_stability.qmd, docs/debugging.qmd); the skill turns that prose into a guided "loss is NaN / OOM / hang / loss not decreasing → diagnose" workflow. Glue over existing docs + commands, no new logic.

Issue → minimal-repro triage. PyTorch triaging-issues/scrub-issue/fix-issue, Megatron respond-to-issue. Take a user's failing config, produce a minimal repro (tiny model + tiny dataset, preserving the SFT/DPO/GRPO/LoRA/multimodal path).

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions