Skip to content

Can't reproduce performance in Alimeeting with pyannote version 4.04 #1989

@taqta

Description

@taqta

Tested versions

Reproducible :pyannote 4.04

System information

linux ubuntu 22.04 GPU:4090 cuda_12.8.1_570.124.06_linux

Issue description

Thanks for your great work on this project. I'm having trouble reproducing the reported performance on the AliMeeting dataset. When comparing my reproduced output with the official results shown on thehttps://huggingface.co/pyannote/speaker-diarization-3.1, I observed a significant DER gap—for example, on audio R8002_M8002_MS802, my DER reaches 12% (and 8% when limited to the first 180 seconds due to file size constraints). Could you help clarify if there are any specific settings or preprocessing steps I might be missing?
the inference code:

# instantiate the pipeline
from pyannote.audio import Pipeline
import torch
# import soundfile as sf
import argparse
import json
import logging
import os
from pathlib import Path
from typing import Optional
# import soundfile as sf
import torch
# import textgrid
# from modelscope.msdatasets import MsDataset

try:
    from pyannote.audio import Pipeline
    from pyannote.metrics.diarization import DiarizationErrorRate
    from pyannote.core import Annotation, Segment
except Exception:
    raise


pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  cache_dir='/data/data/pretrained_model'
).to(torch.device("cuda"))

audio_path='/data/code/pyannote-audio/out/R8002_M8002_MS802_1ch_cut.wav'
# run the pipeline on an audio file
print("Running pipeline on %s..." % audio_path)
# diarization = pipeline(audio_input)
diarization = pipeline(audio_path)
print("end of pipeline")
print("writing output to disk...")
output_path = "/data/code/pyannote-audio/out/inference_reproduce_cut180s.rttm"
with open(output_path, "w") as f:
    for turn, speaker in diarization.speaker_diarization:
        # print(f"start={turn.start:.1f}s stop={turn.end:.1f}s {speaker}")
        f.write(f"SPEAKER {os.path.basename(audio_path)} 1 {turn.start:.3f} {turn.end - turn.start:.3f} <NA> <NA> {speaker} <NA> <NA>\n") 

the audio file in180s long
R8002_M8002_MS802_1ch_cut.wav
the uv environment
uv.lock.txt
pyproject.txt
the output I get
inference_reproduce_cut180s_rttm.txt
the result reported in benchmark https://huggingface.co/pyannote/speaker-diarization-3.1:
benchmark_cut180s_rttm.txt

Minimal reproduction example (MRE)

no link

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions