ストリーミングモードのdecodeを実装（precompute_renderとrender） by Yosshi999 · Pull Request #854 · VOICEVOX/voicevox_core

Yosshi999 · 2024-10-12T14:16:57Z

内容

ストリーミング処理をするための、中間表現の出力 seekable_synthesis と時間区間を指定して音声(16bit PCMバッファ)を出力する render の実装
元の synthesis はWAVバイナリを出力していたが、ストリーミング処理ではWAVの結合が面倒なのでPCMを出す形にした。

その他

qryxip

実装面でのレビューです。

あとヒホさんにも一度目を通して貰った方がよいかなと思いました。

example/python/run.py

crates/voicevox_core/src/synthesizer.rs

Hiroshiba

見ました！！！！　全体的に大きな問題はないと思います！！！！
計算も大丈夫そう！！

変数名の名称等に関してコメントしました！
あまり興味なかったら辛いと思うので、結構素直にあまり興味ないことを伝えていただければ･･･！
（ただまあ綺麗な API を提供したいという気持ちはおそらくみんなあるはずで、議論できるとこちら的にありがたいです！）

あ、推論コードのテストが書かれてないかも。
確かただ通るだけのテストが書かれてるので、もしよければ！

（これは生放送の時に @qryxip さんと話したのですが）

音声の結果が変わってないことをスナップショットで確認したいかも･･････！
ただ OS によって結果が変わるらしいので、やるならWindowsのみでテストするとかになりそう。
さすがにOSが一緒なら計算結果は一緒だろう、くらいの気持ちで。
まあ仮にPC によって計算結果が違うなら、Github Actions上でだけテストする感じにするとか。

ただこれを @Yosshi999 さんにお願いするのは、ちょっとテストのスナップショット周りが少し入り組んでいるので難しそうなのかなと感じています。
難しそうであれば @qryxip さんにお願いしたい気持ちがあります･････！

ちなみにe2eのスナップショットテストはpredict durationとintonationについては行われていそうでした。この辺りです。

voicevox_core/crates/voicevox_core_c_api/tests/e2e/testcases/compatible_engine.rs

Lines 88 to 91 in 683da55

    
           float_assert::close_l1(&phoneme_length, &EXAMPLE_DATA.duration.result, 0.01); 
        
           float_assert::close_l1(&intonation_list, &EXAMPLE_DATA.intonation.result, 0.01); 
        
           assert!(wave.iter().copied().all(f32::is_normal));

crates/voicevox_core_python_api/src/lib.rs

crates/voicevox_core/src/synthesizer.rs

crates/voicevox_core_python_api/src/lib.rs

crates/voicevox_core/src/synthesizer.rs

crates/voicevox_core_python_api/src/lib.rs

Yosshi999 · 2024-10-18T15:29:05Z

テストはどういったものを書きましょうか。デグレ防止という意味だと現状のtest_utilでは不十分だと思います（decode()のapiは変わっていないが実装をseekable版に書き換えているので）。
全体を一度で変換した場合とチャンクに区切って変換して最後に合わせた場合とを比較するというテストならいくらか簡単だと思います。ただ、現状decodeに対応するseekable版が存在しない(audio_queryを入力とするsynthesisに対してseekable_synthesis+renderはあるが、f0を直で入れるseekable版は無い)ので、testcase内でそういったものを実装することになり、APIのテストのカバレッジ向上という観点ではちょっと微妙になります

qryxip

今気付いた点として:

crates/voicevox_core/src/engine/audio_file.rs

crates/voicevox_core/src/synthesizer.rs

qryxip · 2024-10-19T13:22:35Z

旧decode版からデグレっていないことを確認するテストについては、mainブランチにスナップショットテスト的なものを…と思いましたが、 #851 はもうマージしちゃってますね。

いくつかのWAVの完全一致を手元で手動で…とか？（テストはテストで将来のために作るとして）

Hiroshiba · 2024-10-19T21:13:47Z

テスト

あっ @qryxip さんの いくつかのWAVの完全一致を手元で手動で…とか？で、一番気になる部分の確認は全く問題なさそうだと気づきました！！

@Yosshi999 さんの作ってくださったpython版サンプルが２つとも実装されているので、これでチェックするのが簡単＆安心できそうと思いました！
想像してる手順はこんな感じ：

変更前のpython版sampleで何個かTTS音声を作る
変更後のpython版sampleで同じ音声をsynthesisで作る
2の音声をrenderで作る
完全一致するか確認する（.zipにまとめるてGithubコメントにドラッグ・アンド・ドロップすれば共有できます）

みたいな･･･！WAVフォーマットがちょっと変わるとかは問題なさそう。

予感として、DirectMLだと何かしらの計算誤差が出るかも。
cpuだと少なくとも同じマシン上では計算誤差が出ないかも。もしかしたら出るかも。

Hiroshiba · 2024-10-21T16:39:53Z

@Yosshi999 比較ありがとうございます！！

別リポジトリに切り出すかどうか置いといて、「これが作られるべき」なものをファイル保存しといてそれと比較するテストのことをスナップショットテストと呼んだりします！
実際Python版のエンジンでは音声のスナップショットテストをしているので、もし参考になれば！
https://github.com/VOICEVOX/voicevox_engine/blob/dd7a7a7cd1ebe6004a57e6f7d5af2a37bcc4a33e/test/e2e/single_api/tts_pipeline/test_synthesis.py#L43
（たしかlinuxとmacとwindowsで、float32だかfloat64だかの差で計算結果が異なるので、その微差は許容するようなテストになってたと思います）

比較結果なんですが、完全一致しない感じなんですね！！！！！！
以前のものと今のものが合わない・今のものもrenderかどうかで合わない、という感じでしょうか。
これって理論上は完全一致するはず･････という理解あってそうでしょうか。
だとしたら何でずれてるんだろ。。。。。　計算するたびに値がずれるとか･････？（だとしたらもうしょうがなそうなので）

Yosshi999 · 2024-10-22T11:21:54Z

誤差については完全に謎です。トリミングしたことで（このPRの実装ではsynthesisで全体を生成したときもトリミングが入ります）計算順序がずれたのかなと思いますが、確実なことは言えないです
よく考えたらonnxを分割する前のmain branch 991fbc8とも比較すべきなので後でやります

Hiroshiba · 2024-10-22T17:51:49Z

なるほどです！！　僕もありえるとしたら計算順序の違いだろうなという気はしています！
何か確認できる方法があればいいけど、なければもうエイヤで決めてしまうしかなさそう！！

とりあえずこちらの数値誤差であれば、まあおそらく問題ないだろうとは感じています･･･！
もしかしたらマージ後にでも確認させていただくかもです！
仮に何かしら耳にわかるレベルのエラーがあったとしても、よほどなことがない限り利便性を加味するとそのまま実装した方が絶対いいと思ってます･･･！！

Yosshi999 · 2024-10-25T18:54:25Z

991fbc8 -> 7dd6738 (decodeモデル分割）では変化なし

compare main-991fbc8.wav and main-7dd6738.wav
diff max: 0.0
diff mean: 0.0

Yosshi999 · 2024-10-25T19:00:54Z

現在のPRコミットabdc696でPADDING_SIZEを0にしてデフォルトのテキストを生成したり「あ」を生成したりしてみましたがエラーにはなりませんでした（workaroundのときに言われていたように音声冒頭が切れてはいました）

Hiroshiba · 2024-10-25T23:50:54Z

何度考えてもこの機能楽しみですねぇ！！！

991fbc8 -> 7dd6738 (decodeモデル分割）では変化なし

･･･！！！！　あれ、なるほどです！
てっきりここでずれたのだと思っていたのですが、完全一致するんですね！！

これって一応確認すると、以前はdecode.onnxだけでdecode推論してたのを、sosoa.onnxとvocoder.onnxの部分に分けてdecode相当を推論しても、結果が完全一致したってことですよね。

もう１つ確認すると、このコメントの↓の結果って、どちらも「sosoaとvocoderでdecode相当を推論してる」という認識であってそうでしょうか？ 👀

compare main-7dd6738.wav and streaming_decoder-a5a08e8.wav

もしあってたら、原因箇所はonnxruntimeではないはずなので、原因はかなり絞れそうだなと！！！
（おそらくこのPR内の、decodeと保存の経路のどこか）

パッと思いつくのはwave化するときにfloat32にしてるかfloat64にしてるかとか、flameの丸め込み方向が変わったとかがありえそう！
ちょっと大変だけど１つ１つ数値をダンプしてったり関数を戻したりすればどこに原因があるかは特定できそう！！！

Yosshi999 · 2024-10-26T06:59:04Z

991fbc8 -> 7dd6738 (つまり#851 による変更)ではおっしゃる通りdecode.onnxをふたつに分割し、中間物(sosoaのoutput)はそのままvocoderに入力しています。この変更で誤差が生じなかったのはこちらの

この分割自体は出力の変化は発生していません

での言及を再現したものになってます。

mainとこのPR(streaming無し)で初めて誤差が発生していますが、おそらくこれはsynthesize時に中間物をそのまま入れるのではなく、「再生領域全体」を指定してrender()に投げているために、始端終端のパディング量がMARGINに切り揃えられたことが原因と考えられます。もちろんvocoderのreceptive fieldの幅から考えれば変化しないはずですが、、、計算順じゃないかな〜とは思います。
PRとPR(streamingあり)での変化も、本来は等価なので誤差が発生しないはずですが、謎です。

main->PR(streaming無し)での変化がどうしても気になるなら、実装を修正してsynthesize関数の挙動をmain branchのものと完全に一致させることはできます。ただrender_all関数を生やすみたいな感じで実行パスが一個増えるのを保守の観点でどう捉えるかになります

Hiroshiba · 2024-10-26T13:35:44Z

あーーーーーーーーーーーーーーーーーなるほどです！！！！！
おっしゃっていたことを完全に理解しました！！！！

確かにパディングの長さが違うというか、入力が異なるので何らかの何らかが発生してずれうる気がします！！！
切り揃えられるてるから全体入力が同じじゃない点が盲点でした、すごくよくわかりました！！！

おっしゃる通り実装を以前のに合わせれば、つまり余分に計算して切り揃えれば完全一致する予感がありますが、少なくとも実装を戻す必要はないと思います！
一度実装を戻してから完全一致するか確認できると一番安心ですが、まーーーーー別にいいかなと！！！
実際に製品版で精製してみた音声がなんか違和感あったら確認する、みたいな感じで･･･！！！

いやーーすみません、認識ずれていてお手数おかけしてしまいました、ありがとうございました！！
あとはエラーのとこが良い感じにできればマージの方向が良さそう！！

qryxip

LGTM!

sevenc-nanashi

大丈夫だと思います。

crates/voicevox_core/src/synthesizer.rs

crates/voicevox_core/src/engine/audio_file.rs

Hiroshiba

LGTM！！

ちょっとドキュメントの部分部分だけ揃えさせていただきます！！
あとちょっとタイトルだけわかりやすく変えさせていただきます！

crates/voicevox_core/src/synthesizer.rs

example/python/run.py

Co-authored-by: Nanashi. <sevenc7c@sevenc7c.com>

qryxip · 2024-10-29T15:27:34Z

@Yosshi999 このPRのdescriptionの"Solve"ですが、おそらく正しいのは"Resolve"かと思います。"Solve"だとissueはlinkされず、PRをマージしても自動で閉じられません。なので今私が手動で #853 をlinkしてからマージしようと思います。

もし意図と外れているのであれば #853 をre-openして頂ければ！

Refs: #854 (comment)

Yosshi999 added 4 commits October 10, 2024 01:09

split decoder into spectrogram and vocoder without changing API

ff3d552

cargo fmt

c916c29

rename functions, formatting

4ec25c1

split decoding process into two phases

2897a8c

Yosshi999 mentioned this pull request Oct 12, 2024

ストリーミング処理の対応 #853

Closed

Yosshi999 added 2 commits October 13, 2024 01:13

implement python api

0f2babd

pcm writeout

ec8fac0

Yosshi999 marked this pull request as ready for review October 12, 2024 18:07

Yosshi999 added 4 commits October 15, 2024 22:57

Merge branch 'main' into streaming_decoder

ccf2bd9

cargo fmt

ba99e73

add comment

af51dd2

fix minor naming

5ae184f

Yosshi999 requested review from PickledChair and qryxip and removed request for qryxip October 15, 2024 14:17

qryxip reviewed Oct 15, 2024

View reviewed changes

example/python/run.py Outdated Show resolved Hide resolved

example/python/run.py Outdated Show resolved Hide resolved

crates/voicevox_core/src/synthesizer.rs Outdated Show resolved Hide resolved

crates/voicevox_core/src/synthesizer.rs Outdated Show resolved Hide resolved

qryxip requested a review from Hiroshiba October 15, 2024 16:32

better readability, impl pcm to wavfile

43cee57

Hiroshiba reviewed Oct 16, 2024

View reviewed changes

qryxip reviewed Oct 17, 2024

View reviewed changes

Yosshi999 added 3 commits October 18, 2024 22:24

follow suggestions

38be671

follow suggestion part2

6db7b87

move wav_from_s16le to top module

f2bcf65

Yosshi999 added 2 commits October 19, 2024 00:31

I forgot to add the new file

aa6ea80

fix rustlint

fd8fd63

qryxip reviewed Oct 19, 2024

View reviewed changes

crates/voicevox_core/src/engine/audio_file.rs Outdated Show resolved Hide resolved

crates/voicevox_core/src/engine/audio_file.rs Outdated Show resolved Hide resolved

crates/voicevox_core/src/synthesizer.rs Outdated Show resolved Hide resolved

Yosshi999 added 4 commits October 27, 2024 00:19

margin validation

6e41b94

remove unnecessary parentheses

6620df5

change ERR to panic in validation error

8cb6aac

consistent arguments name

b6fcc3b

qryxip self-requested a review October 29, 2024 12:58

qryxip approved these changes Oct 29, 2024

View reviewed changes

qryxip requested review from Hiroshiba and removed request for PickledChair October 29, 2024 13:00

sevenc-nanashi approved these changes Oct 29, 2024

View reviewed changes

crates/voicevox_core/src/synthesizer.rs Outdated Show resolved Hide resolved

crates/voicevox_core/src/engine/audio_file.rs Outdated Show resolved Hide resolved

Hiroshiba approved these changes Oct 29, 2024

View reviewed changes

crates/voicevox_core/src/synthesizer.rs Outdated Show resolved Hide resolved

crates/voicevox_core/src/synthesizer.rs Outdated Show resolved Hide resolved

example/python/run.py Outdated Show resolved Hide resolved

Apply suggestions from code review

9720cbc

Co-authored-by: Nanashi. <sevenc7c@sevenc7c.com>

Hiroshiba changed the title ~~Streaming decoder~~ ストリーミングモードのdecodeを実装（precompute_renderとrender） Oct 29, 2024

qryxip linked an issue Oct 29, 2024 that may be closed by this pull request

ストリーミング処理の対応 #853

Closed

qryxip merged commit 5641e37 into VOICEVOX:main Oct 29, 2024

qryxip mentioned this pull request Oct 30, 2024

fix: Python APIとexample/python/run.pyの型付けを直す #864

Merged

qryxip added a commit that referenced this pull request Oct 30, 2024

fix: Python APIとexample/python/run.pyの型付けを直す (#864)

e094b52

Refs: #854 (comment)

This was referenced Nov 1, 2024

refactor: Synthesizerの実装をInner<_, A: Async>の形にする #865

Merged

ストリーミング処理のC API実装 #866

Closed

Yosshi999 mentioned this pull request Nov 2, 2024

fix compat breaking: revive workaround padding in decode() #867

Merged

qryxip mentioned this pull request Nov 9, 2024

docs: "APIデザインガイドライン"を追加 #870

Merged

qryxip mentioned this pull request Dec 22, 2024

ベンチマークコード #901

Closed

3 tasks

	float_assert::close_l1(&phoneme_length, &EXAMPLE_DATA.duration.result, 0.01);
	float_assert::close_l1(&intonation_list, &EXAMPLE_DATA.intonation.result, 0.01);

	assert!(wave.iter().copied().all(f32::is_normal));

Conversation

Yosshi999 commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

内容

関連 Issue

その他

Uh oh!

qryxip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hiroshiba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yosshi999 commented Oct 18, 2024

Uh oh!

qryxip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qryxip commented Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hiroshiba commented Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hiroshiba commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yosshi999 commented Oct 22, 2024

Uh oh!

Hiroshiba commented Oct 22, 2024

Uh oh!

Yosshi999 commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yosshi999 commented Oct 25, 2024

Uh oh!

Hiroshiba commented Oct 25, 2024

Uh oh!

Yosshi999 commented Oct 26, 2024

Uh oh!

Hiroshiba commented Oct 26, 2024

Uh oh!

qryxip left a comment

Choose a reason for hiding this comment

Uh oh!

sevenc-nanashi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Hiroshiba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qryxip commented Oct 29, 2024

Uh oh!

Reviewers

Yosshi999 commented Oct 12, 2024 •

edited

Loading

qryxip commented Oct 19, 2024 •

edited

Loading

Hiroshiba commented Oct 19, 2024 •

edited

Loading

Hiroshiba commented Oct 21, 2024 •

edited

Loading

Yosshi999 commented Oct 25, 2024 •

edited

Loading