qwen-aligner-toolkit

Production toolkit around Qwen3-ForcedAligner: VAD pre-segmentation, multi-language word/char-level forced alignment, and speaker diarization.

Why

qwen-asr ships an excellent forced aligner (Qwen3-ForcedAligner-0.6B) that handles multiple languages with one model — no per-language wav2vec2 needed, no kanji vocab gaps. This toolkit packages it with the surrounding pieces you typically need in production:

VAD (pyannote/segmentation-3.0) to chunk long audio
Diarization (pyannote/speaker-diarization-community-1) with per-word speaker assignment and run smoothing
Audio utilities (path / URL / bytes / np.ndarray → 16 kHz mono)
AVX-less CPU compatibility via a nagisa char-level stub

The toolkit does not do ASR. Bring your own transcript (from Whisper via vLLM, faster-whisper, qwen-asr itself, or any other system) and the toolkit will time-align it and label speakers.

Install

pip install qwen-aligner-toolkit
pip install qwen-aligner-toolkit[full]    # with VAD + diarization

Usage

Simple alignment (short audio)

from qwen_aligner_toolkit import Aligner

aligner = Aligner.from_pretrained()
words = aligner.align(
    text="甚至出现交易几乎停滞的情况。",
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
    language="Chinese",
)
for w in words:
    print(w.start_time, w.end_time, w.text)

ASR segments → word-level + speakers

from qwen_aligner_toolkit import Pipeline

pipeline = Pipeline.from_pretrained(hf_token="hf_...", device="cuda")

segments = [
    {"text": "おはようございます。", "start": 0.0, "end": 2.3},
    {"text": "今日は良い天気ですね。", "start": 2.3, "end": 5.1},
]

result = pipeline.align_segments(
    segments=segments,
    audio="audio.wav",
    language="Japanese",
    diarize=True,
)

for w in result.words:
    print(f"{w.start_time:.2f}-{w.end_time:.2f} [{w.speaker}] {w.text}")

VAD only

from qwen_aligner_toolkit import VAD

vad = VAD.from_pretrained(hf_token="hf_...")
chunks = vad.detect("audio.wav")

Sharing the segmentation model between VAD and Diarizer

pyannote/speaker-diarization-community-1 already loads a copy of pyannote/segmentation-3.0 internally. To avoid loading it twice when you also need standalone VAD, build the VAD on top of the diarizer's segmentation submodel:

from qwen_aligner_toolkit import Diarizer, VAD

diarizer = Diarizer.from_pretrained(hf_token="hf_...")
vad = VAD.from_segmentation_model(diarizer.segmentation_model)
chunks = vad.detect("audio.wav")

Per-speaker grouped output

If you want speaker turns with joined text (one entry per consecutive same-speaker run, with short-run smoothing), call Diarizer.split_words_by_speaker on word-level dicts:

from qwen_aligner_toolkit import Diarizer

words = [
    {"word": "hi",  "start": 0.0, "end": 1.0, "speaker": "A"},
    {"word": "yes", "start": 1.0, "end": 2.0, "speaker": "B"},
]
turns = Diarizer.split_words_by_speaker(words, min_duration_sec=0.3)
# [{"text": "hi", "start_time": 0.0, "end_time": 1.0, "speaker": "A"},
#  {"text": "yes", "start_time": 1.0, "end_time": 2.0, "speaker": "B"}]

For CJK languages where you don't want a space between concatenated tokens, pass join_separator="". The dict keys are configurable via word_key / start_key / end_key / speaker_key.

Configuration knobs

The toolkit has no global config — every knob is a function parameter. The table below maps common production env-var conventions to the corresponding toolkit argument, so you can wire them up with one-line plumbing.

VAD

Env var	Default	Toolkit argument
`ASR_VAD_ONSET`	`0.5`	`VAD.detect(onset=...)`
`ASR_VAD_OFFSET`	`0.5`	`VAD.detect(offset=...)`
`ASR_VAD_MIN_DURATION_ON`	`0.25`	`VAD.detect(min_duration_on=...)`
`ASR_VAD_MIN_DURATION_OFF`	`0.5`	`VAD.detect(min_duration_off=...)`
`ASR_VAD_MAX_CHUNK_SEC`	`30.0`	`merge_segments(max_chunk_sec=...)` / `Pipeline.vad_chunks(max_chunk_sec=...)`
`ASR_VAD_MAX_GAP_SEC`	`0.5`	`merge_segments(max_gap_sec=...)`
`ASR_VAD_PADDING_SEC`	`0.2`	`Aligner.align_segments(padding_sec=...)`

Diarization

Env var	Default	Toolkit argument
`HF_TOKEN`	—	`Diarizer.from_pretrained(hf_token=...)`
`ASR_DIARIZATION_MODEL`	`pyannote/speaker-diarization-community-1`	`Diarizer.from_pretrained(model_id=...)`
`ASR_SPEAKER_MIN_DURATION_SEC`	`0.3`	`Diarizer.split_words_by_speaker(min_duration_sec=...)`

Device / sample rate

Env var	Default	Toolkit argument
`ASR_DEVICE`	`cuda` (or `cpu` if no GPU)	`Aligner.from_pretrained(device_map=...)`, `Diarizer.from_pretrained(device=...)`, `VAD.from_pretrained(device=...)`, `Pipeline.from_pretrained(device=...)`

All device= / device_map= arguments accept None to auto-detect (cuda if available, else cpu).

The toolkit operates internally at 16 kHz mono. Audio is auto-resampled on load; there is no target_sample_rate knob exposed at the public API.

Feature toggles

Flags like ASR_VAD_ENABLED or ASR_DIARIZATION_ENABLED belong to the orchestration layer of your application, not the toolkit. The toolkit exposes capabilities as separate classes (VAD, Diarizer) and the Pipeline constructor takes explicit with_vad= / with_diarization= flags:

Pipeline.from_pretrained(
    hf_token=HF_TOKEN if ASR_DIARIZATION_ENABLED else None,
    with_vad=ASR_VAD_ENABLED,
    with_diarization=ASR_DIARIZATION_ENABLED,
)

CPU compatibility (AVX requirement)

qwen-asr depends on nagisa, which ships DyNet38 compiled with AVX. On AVX-less CPUs (Intel Celeron G-series, some embedded SoCs), importing nagisa crashes with SIGILL.

If that affects you, install the char-level stub:

qwen-aligner-toolkit install-nagisa-stub
# or
python -m qwen_aligner_toolkit.compat install-stub

This downgrades Japanese tokenization from morpheme-level to character-level (other languages are unaffected). For forced alignment the difference is mostly cosmetic; speaker boundaries may even be detected at finer granularity.

For Docker:

RUN pip install qwen-aligner-toolkit && \
    qwen-aligner-toolkit install-nagisa-stub

License

Apache 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen-aligner-toolkit

Why

Install

Usage

Simple alignment (short audio)

ASR segments → word-level + speakers

VAD only

Sharing the segmentation model between VAD and Diarizer

Per-speaker grouped output

Configuration knobs

VAD

Diarization

Device / sample rate

Feature toggles

CPU compatibility (AVX requirement)

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

qwen-aligner-toolkit

Why

Install

Usage

Simple alignment (short audio)

ASR segments → word-level + speakers

VAD only

Sharing the segmentation model between VAD and Diarizer

Per-speaker grouped output

Configuration knobs

VAD

Diarization

Device / sample rate

Feature toggles

CPU compatibility (AVX requirement)

License