Turn a Discrub Discord DM export into a clean, multi-persona, ShareGPT-style SFT corpus for Qwen (or any chat LLM), with high quality data heuristics and an optional LLM-as-judge for persona fit. Related work: Llama 3.1 8B SFT
Place raw Discrub exports under a folder you choose (the UI defaults to discord_messages/
at the repo root). Generated artifacts — out/ LoRA checkpoints, eval logs, curated JSONL —
stay on disk locally and are ignored by git (see .gitignore).
Discord exports and anything derived from them (curated logs, ShareGPT JSONL, checkpoints, eval artifacts) can include identifiers, conversation content, and inferred relationships. You—not this repository—are responsible for how you collect, process, store, and share that data.
- Respect Discord’s Terms of Service, Community Guidelines, and applicable privacy law. Only process exports you are allowed to use (typically your own data, or with clear consent).
- Do not commit raw exports, checkpoints, or datasets that identify people to a public fork or publication without permission and a clear lawful basis.
- Heuristics and optional judges reduce risk (PII scrub, dedup gates) but are not a guarantee of anonymity or suitability for redistribution.
- API keys (
OPENROUTER_API_KEY,HF_TOKEN, W&B credentials, etc.) belong in your environment or private config—never in git (see.env.example).
This software is provided as a tool; use it in compliance with your obligations to Discord, conversation participants, and third-party services you call.
| Topic | Where |
|---|---|
Training (Unsloth, uv/lockfile, GH200, merge-peft, W&B, config keys, gotchas) |
discord_sft/training/README.md |
Evaluation (lmms-eval, bootstrap, vLLM, benchmarks, artifacts, scope limits) |
discord_sft/evals/README.md |
| Shipped YAML recipes for LoRA | discord_sft/training/configs/ |
| Gradient probe matrix and probe → eval workflow | discord_sft/training/lora_search/README.md |
This project uses uv; a committed uv.lock pins dependencies for reproducible training and eval on Linux aarch64 (heavy extras such as [train] and [evals] are oriented to that target).
pip install uv # one-time
uv sync --extra dev # base + pytestStack extras as needed:
uv sync --extra dev --extra ui --extra tokenizers --extra langCommon optional groups: ui (Streamlit discord-sft ui), tokenizers (stats, validate-chat-template), lang (langdetect in curation), train (Unsloth SFT), evals / evals-gpu (benchmark harness). [train] and [evals] cannot be combined in one uv sync (transformers pin conflict)—use separate virtualenvs on the same machine. Full matrix, GH200 bootstrap scripts, bootstrap for lmms-eval, lockfile regeneration, and pip install -e fallbacks: training README and evals README.
On macOS/Windows, the data pipeline (ingest → build-sft, stats, ui) works via editable install without the lockfile, for example uv pip install -e "."; see the training readme for cross-platform testing notes.
Discrub JSON pages
│
▼ discord-sft ingest
normalized message log (parquet/jsonl per folder)
│
▼ discord-sft curate
sessions.jsonl (burst-merged, reply-inlined, gap-split, PII-scrubbed, near-dup-collapsed)
│ (optional: discord-sft curate-sweep → JSONL of metrics per setting grid)
▼ discord-sft build-sft
train.jsonl / val.jsonl / balance_report.json / turn_length_report.json (ShareGPT; one sample per session per persona, ending on that persona's last line)
│
├──► discord-sft stats (total tokens; compare Qwen3 vs Qwen3.5 tokenizers)
├──► discord-sft validate-chat-template (optional; needs `[tokenizers]` or `[train]`)
├──► discord-sft fingerprint (per-persona style profile; feeds eval-heuristics)
└──► discord-sft eval (benchmarks + persona evals; [runbook](discord_sft/evals/README.md))
discord-sft ingest --source discord_messages --out out/messages --format parquet
discord-sft curate --source discord_messages --out out/curated
discord-sft curate-sweep --source discord_messages --out out/curate_sweep.jsonl \
--sweep-session-gap-min 30,60,120 --no-near-dedup
discord-sft build-sft --input out/curated/sessions.jsonl --out out/sft
discord-sft validate-chat-template --input out/sft/train.jsonl --tokenizer Qwen/Qwen3.5-35B-A3B
discord-sft stats --input out/sft/train.jsonl --tokenizer Qwen/Qwen3-8B --tokenizer Qwen/Qwen3.5-35B-A3B
discord-sft fingerprint --input out/sft/train.jsonl --out out/sft/profiles.json
discord-sft eval-heuristics \
--references refs.txt --generated gen.txt \
--profile out/sft/profiles.json --persona <snowflake_id_from_profiles.json>uv sync --extra ui (or pip install -e ".[ui]"), then discord-sft ui. Press Ctrl+C once to stop; the launcher terminates the Streamlit child. Sections: Home, Data (ingest / curate / build SFT), Analyze (fingerprint, tokens, browsers), Train, Evaluate.
{
"system": "You are user1 chatting with user2.",
"conversations": [
{"from": "user", "value": "i can trade on US hours"},
{"from": "assistant", "value": "go set alarm for 6h\nthats when markets open"}
],
"meta": {
"persona_id": "123123123020202",
"persona_name": "user1",
"counterparty_ids": ["09123912039123"],
"session_id": "user2#2025-07-02T17:09:14Z",
"num_turns": 2
}
}Implementation lives under discord_sft.data_prep.*: drop non-chat message types and bad replies; strip emojis/mentions and optional URLs; PII scrub; burst-merge and reply-inline; gap-based sessions; session quality gates (min turns, author mix); exact and MinHash near-deduplication; optional langdetect filter with [lang].
After curate, read curate_report.json (or the Data → Curate UI): compare sessions_built vs sessions_kept and use dropped_* counts to choose which gate to adjust. Change one knob per run and a new output directory. Align --min-turns with desired ShareGPT lengths; for many num_turns: 2 rows, use curate … --min-turns 2, then check turn_length_report.json / balance_report.json after build-sft. Metrics-only grid: discord-sft curate-sweep (see --help; output JSONL has params + full report per combo).
build-sft emits at most one ShareGPT row per session per speaking author (window ends on that persona’s last turn). Persona balance (--balance median by default), ShareGPT turn caps (--min-sharegpt-turns / --max-sharegpt-turns), and optional --turn-mix are documented in --help. LoRA training, recipes, checkpoints, run.json, merge-for-vLLM: discord_sft/training/README.md.
uv sync --extra train # in a train-only venv; see training README
discord-sft train --config discord_sft/training/configs/qwen35_a3b_style_late.yamlCheckpoints live under out/lora/<run_name>/. For eval without MoE adapter quirks, merge adapters then eval the dense folder—see merge-peft in the training README. Shipped YAML reference: discord_sft/training/configs/README.md.
discord-sft eval runs lmms-eval benchmarks plus native persona evals; each run is out/evals/runs/<run_id>.json. Install, commands, tables, vLLM, --out layout (use the eval root, not …/runs), and what is out of scope for this harness: discord_sft/evals/README.md.
discord_sft.data_prep— ingest, normalize, curate, SFT build/split/balance.discord_sft.analysis—tokstats,fingerprint,heuristics.discord_sft.evals— runner, judge, benchmarks (runbook).