This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
These are non-negotiable. Violating them has burned the user before.
Your training cutoff is older than the user's reality. The user is a working developer with access to the latest releases — Gemini 3.5, Claude 4.x, GPT-5 variants, whatever's actually current. If a model name, API endpoint, library version, or product feature looks unfamiliar or "doesn't exist," assume YOUR knowledge is stale, not theirs.
Concrete rules:
- Do NOT remove model entries from catalogs (e.g.
GEMINI_MODELS, Claude/OpenAI/Grok caps maps inbackend/assistant_routes.py) because you don't recognize them. - Do NOT pin libraries down to versions you "know" exist when a newer one is in the lockfile.
- Do NOT replace a "preview" / "experimental" / "-latest" model id with a stable one you remember from training.
- If you genuinely need to update a model list, fetch the source
of truth FIRST (WebFetch on
https://ai.google.dev/gemini-api/docs/models,https://docs.anthropic.com/en/docs/about-claude/models,https://platform.openai.com/docs/models, etc.) — never write from memory. Then if you're proposing a downgrade, ASK the user first and let them confirm.
If you accidentally do downgrade, immediately fetch the docs and restore the full catalog.
Exactly ONE ruff version exists in this repo's tooling chain at all
times. It's pinned in pyproject.toml (dependency-groups.dev) AND
.github/workflows/lint.yml (the RUFF_VERSION env var) AND used via
uv run ruff … so the project venv's ruff is what runs. Symptoms of a
violation: ruff format --check complains about reformatting files
that were clean last commit, with no semantic edits in between.
Concrete rules:
- Never
pip install rufforpipx install ruffglobally without matching the pinned version exactly. - Never edit only one of the two pin sites — always update both in the same commit.
- Before committing, run
uv run ruff check .ANDuv run ruff format --check .from the repo root. Both must pass. - If
ruff formatdrifts after a session where nothing semantic changed, the FIRST suspect is a version mismatch — investigate before you "fix" the drift.
See the ## Ruff Configuration section below for more detail.
Stable Audio 3 is a text-conditioned audio generation system. It generates audio from text prompts using a two-stage architecture: a DiT (diffusion transformer) generates latents, then the SAME autoencoder decodes them to 44.1kHz stereo audio.
# Install dependencies
uv sync --group dev
# Run Gradio UI
uv run python run_gradio.py --model medium
# Run tests (requires model weights downloaded)
uv run pytest
# Run single test file
uv run pytest tests/test_inference.py
# Run tests and save generated audio for inspection
uv run pytest --save-audio
# Lint (runs on CI for PRs)
uv run ruff check
uv run ruff format --check-
SAME Autoencoder (
models/autoencoders.py) — Compresses 44.1kHz stereo audio to 256-dim continuous latents at 4096x downsampling. Two variants: SAME-S (266M, CPU-capable, chunked attention) and SAME-L (1.7B, GPU-required, sliding window attention). -
DiT (
models/dit.py→models/transformer.py) — Conditional diffusion transformer that generates SAME latents. Uses T5Gemma text conditioning, duration embeddings, and optional inpainting inputs. Three sizes: Small (433M), Medium (1.4B), Large (2.7B, API-only).
pipeline.py— Public API.StableAudioPipelineandAutoencoderPipelineclasses. All inference flows go throughgenerate().model.py— Model construction from config JSON.create_diffusion_cond_from_config()builds the full model graph.model_configs.py— Maps model names ("small", "medium", "medium-rf") to HuggingFace repo IDs and checkpoint filenames.loading_utils.py— Loads safetensor checkpoints, handles state dict key remapping between ARC/RF/standalone formats.inference/sampling.py— All samplers: Euler, RK4, DPM++, Ping-Pong.sample_diffusion()is the unified entry point.inference/distribution_shift.py— Timestep schedule warping (Flux shift, LogSNR shift).models/conditioners.py—T5GemmaConditionerloadsgoogle/t5gemma-b-b-ul2for text encoding.NumberConditionerfor duration.models/lora/— LoRA implementation: parametrization, loading, stacking multiple LoRAs, per-layer filtering, interval-based activation.interface/diffusion_cond.py— Gradio UI wiring. Calls pipeline, handles file naming, audio format conversion via ffmpeg.
| Key | Type | Purpose |
|---|---|---|
small, medium |
ARC | Primary inference (post-trained, 8-step) |
small-rf, medium-rf |
RF | Base checkpoints for LoRA training |
same-s, same-l |
Autoencoder | Standalone encode/decode without DiT |
ARC and RF checkpoints bundle the autoencoder inside. Standalone SAME checkpoints share weights with the bundled versions and will reuse cached full checkpoints when available.
The DiT handles classifier-free guidance internally via batch doubling (batch_cfg=True). It also supports APG (Adaptive Projected Guidance) which projects the CFG diff orthogonal to the denoised prediction. ARC models default to cfg_scale=1 (no guidance needed); RF models use cfg_scale=7.
The model supports variable-length sequences without wasting compute on padding. Duration determines the latent sequence length directly. mask_padding_attention=True creates attention masks so padding positions don't corrupt valid content. Distribution shift warps the timestep schedule based on effective sequence length.
⚠️ HARD RULE — RUFF VERSION: Ruff is pinned to ONE exact version in TWO places:pyproject.toml(dependency-groups.dev) and.github/workflows/lint.yml(RUFF_VERSIONenv var). NEVER allow these to drift, NEVER downgrade, NEVER install an older ruff "because it's still compatible," and NEVER let two ruff versions coexist anywhere in this repo's tooling chain. Upgrading is fine — bump BOTH places in the SAME commit, then runuv sync --group devanduv run ruff format .in that same commit. Ifruff format --check .reports drift after the user reports a working tree was previously clean, the FIRST thing to check is whether a different ruff (older, newer, system-wide, pipx) snuck into the resolution chain. Do NOT mask the issue by reformatting against a stale ruff.
Ruff excludes stable_audio_3/models, stable_audio_3/inference, stable_audio_3/interface, and stable_audio_3/data from linting. Only top-level files (pipeline.py, model.py, model_configs.py, loading_utils.py, verbose.py) are checked.
Always run from the repo root, never on a subset of dirs:
uv run ruff check .
uv run ruff format .
CI runs at the repo root, so ruff format backend/ tests/ alone will silently miss stable_audio_3/*.py drift. Local-dev workflow: run BOTH commands above before every commit; the pre-commit chain checks both.
Tests use session-scoped fixtures to avoid reloading models. The model_pipe fixture is parametrized over ["small", "medium"] — medium tests are auto-skipped without a CUDA GPU. --save-audio writes outputs to test_audio_outputs/ for manual listening.
This project uses Tailwind CSS v4. The following v3 forms are forbidden and will cause VS Code Problems tab warnings. Never write them; always use the v4 canonical form instead.
| FORBIDDEN (v3) | REQUIRED (v4) |
|---|---|
!className (prefix important) |
className! (suffix important) |
flex-shrink-0 |
shrink-0 |
flex-grow |
grow |
bg-gradient-to-* |
bg-linear-to-* |
bg-opacity-* |
bg-black/50 style opacity modifier |
w-[300px] when scale token exists |
w-75 (300 ÷ 4) |
h-[14px] when scale token exists |
h-3.5 (14 ÷ 4) |
z-[15], z-[25], z-[200] |
z-15, z-25, z-200 |
min-w-[160px] when scale token exists |
min-w-40 |
min-h-[80px] when scale token exists |
min-h-20 |
bg-white/[0.03], bg-purple-500/[0.04] |
bg-white/3, bg-purple-500/4 |
Scale token rule: Tailwind v4 spacing scale is value ÷ 4. A [Npx] arbitrary value maps to N/4 as a scale token whenever N is divisible by 4 (or to the nearest 0.5 step). Prefer scale tokens over arbitrary values at all times.
Before writing any className string, mentally check it against this table.
The pyproject.toml CUDA index mapping only covers Linux. On Windows:
- PyTorch must be manually installed with
--index-url https://download.pytorch.org/whl/cu128 soundfilepackage is required (torchaudio has no default backend on Windows)- Flash Attention requires pre-built wheels from
kingbri1/flash-attentionGitHub releases - See
docs/windows/setup-guide.mdfor full instructions
The in-app assistant answers from a RAG index built over the docs listed in
backend/rag.py (DOC_PATHS). Keep it current:
- After any major update — a new feature, tab, subsystem, or behavior change
that a user could ask about — update the RAG: write/revise the relevant doc
AND register it in
DOC_PATHSif it's new. Stale or missing docs degrade the assistant's answers. - Run a regular sanity check / maintenance pass: confirm every
DOC_PATHSentry resolves (no missing-doc warnings on startup) and flag docs that have drifted from the current UI/behavior. - All doc/RAG changes, updates, and deletions are approval-based. Research autonomously (read, diff, identify drift) and propose, but wait for approval before editing or deleting. Never auto-delete docs.