在zero shot和cross lingual zero shot测试集上，原始论文（https://arxiv.org/pdf/2505.17589， Table5, Table6, Table8）WER结果

zh | 4.08 en | 6.32 hard_zh | 12.58 hard_en | 11.96 en2zh | 13.5 zh2en | 6.47

使用CV2开源模型在CV3-EVAL上的WER测试结果 zh | 4.51 en | 9.36 hard_zh | 10.99 hard_en | 11.81 en2zh | 11.71 zh2en | 10.64

其中en和zh2en集合结果明显变差，hard_zh和en2zh集合结果明显变好。

请问是论文中的测试方法和EV3 EVAL的测试方法有差异吗？使用CV2模型合成音频的时候是否对prompt audio做了其他处理？

在计算WER时，直接对所有句子的WER百分数做了平均，标准意义上的平均WER应该是在整个测试集上统计插入、删除、替代错误的总数，然后除以参考文本中所有字符数。这里的wer计算是否有问题？

for line in open(infile, "r").readlines():
    wav_path, wer, text_ref, text_res, inse, dele, subs = line.strip().split("\t")
    if float(wer) > 0.5:
        n_higher_than_50 += 1
    else:
        wers_below50.append(float(wer))
        
    wers.append(float(wer))
    wers_clip.append(min(float(wer), 1.0))
    inses.append(float(inse))
    deles.append(float(dele))
    subses.append(float(subs))
    fout.write(line)

wer = round(np.mean(wers)*100,3)

CosyVoice2模型无法复现CosyVoice3论文中的结果 #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions