-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Tested versions
Reproducible :pyannote 4.04
System information
linux ubuntu 22.04 GPU:4090 cuda_12.8.1_570.124.06_linux
Issue description
Thanks for your great work on this project. I'm having trouble reproducing the reported performance on the AliMeeting dataset. When comparing my reproduced output with the official results shown on thehttps://huggingface.co/pyannote/speaker-diarization-3.1, I observed a significant DER gap—for example, on audio R8002_M8002_MS802, my DER reaches 12% (and 8% when limited to the first 180 seconds due to file size constraints). Could you help clarify if there are any specific settings or preprocessing steps I might be missing?
the inference code:
# instantiate the pipeline
from pyannote.audio import Pipeline
import torch
# import soundfile as sf
import argparse
import json
import logging
import os
from pathlib import Path
from typing import Optional
# import soundfile as sf
import torch
# import textgrid
# from modelscope.msdatasets import MsDataset
try:
from pyannote.audio import Pipeline
from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.core import Annotation, Segment
except Exception:
raise
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
cache_dir='/data/data/pretrained_model'
).to(torch.device("cuda"))
audio_path='/data/code/pyannote-audio/out/R8002_M8002_MS802_1ch_cut.wav'
# run the pipeline on an audio file
print("Running pipeline on %s..." % audio_path)
# diarization = pipeline(audio_input)
diarization = pipeline(audio_path)
print("end of pipeline")
print("writing output to disk...")
output_path = "/data/code/pyannote-audio/out/inference_reproduce_cut180s.rttm"
with open(output_path, "w") as f:
for turn, speaker in diarization.speaker_diarization:
# print(f"start={turn.start:.1f}s stop={turn.end:.1f}s {speaker}")
f.write(f"SPEAKER {os.path.basename(audio_path)} 1 {turn.start:.3f} {turn.end - turn.start:.3f} <NA> <NA> {speaker} <NA> <NA>\n")
the audio file in180s long
R8002_M8002_MS802_1ch_cut.wav
the uv environment
uv.lock.txt
pyproject.txt
the output I get
inference_reproduce_cut180s_rttm.txt
the result reported in benchmark https://huggingface.co/pyannote/speaker-diarization-3.1:
benchmark_cut180s_rttm.txt
Minimal reproduction example (MRE)
no link