-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
406 lines (375 loc) · 35.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
<html lang="en-US"><head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,maximum-scale=2">
<link rel="stylesheet" type="text/css" media="screen" href="./assets/css/style.css">
<style>
li {
list-style-type: disc;
}
</style>
<!-- Begin Jekyll SEO tag v2.7.1 -->
<title>Demo for PerTTS</title>
<meta name="generator" content="Jekyll v3.9.0">
<meta property="og:title" content="Abstract">
<meta property="og:locale" content="en_US">
<meta name="description" content="">
<meta property="og:description" content="">
<link rel="canonical" href="https://thuhcsi.github.io/PerTTS/">
<meta property="og:url" content="https://thuhcsi.github.io/PerTTS/">
<meta property="og:site_name" content="PerTTS: Personalized and Controllable Zero-shot Spontaneous Style Text-to-Speech Synthesis">
<meta name="twitter:card" content="summary">
<meta property="twitter:title" content="Abstract">
<script type="application/ld+json">
{"description":"","url":"https://thuhcsi.github.io/PerTTS/","@type":"WebSite","headline":"Abstract","name":"PerTTS: Personalized and Controllable Zero-shot Spontaneous Style Text-to-Speech Synthesis","@context":"https://schema.org"}</script>
<!-- End Jekyll SEO tag -->
</head>
<body>
<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
<img id="lab_logo" src="./assets/images/logo.svg"/>
<div>
<div style="width: 70%;">
<h1 id="project_title">PerTTS: Personalized and Controllable Zero-shot Spontaneous Style Text-to-Speech Synthesis</h1>
</div>
</div>
</header>
</div>
<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
<h1 id="abstract">Abstract</h1>
<p>In spoken scenarios, achieving personalized and controllable zero-shot spontaneous style speech synthesis is highly significant, particularly in generating natural and expressive speech for unseen speakers under data-limited conditions. Traditional methods typically achieve this by fine-tuning pre-trained multi-speaker speech synthesis models or adopting zero-shot adaptation techniques. However, these methods exhibit limitations in voice cloning and style modeling, struggling to capture fine-grained voice characteristics and complex speaking styles of target speakers. In this paper, we propose PerTTS, a personalized and controllable zero-shot spontaneous speech synthesis method. This approach introduces a personalized speaking style encoder that utilizes pre-trained models and a local prosody encoder to extract semantic, duration, timbre and prosody information from multiple reference utterances of the target speaker, thereby forming a comprehensive personalized representation of speaking style. Furthermore, we employ knowledge distillation to learn spontaneous behavior patterns and incorporate a multi-modal pseudo label detector to extract labels from unlabeled data, enabling modeling and control of spontaneous behaviors. This mechanism significantly enhances the naturalness and spontaneity of the synthesized speech. Experimental results demonstrate that PerTTS significantly outperforms existing models in terms of speaking style similarity and speech naturalness. The introduction of personalized speaking style representations effectively improves style similarity, and the incorporation of spontaneous behavior modeling further improves the naturalness and spontaneity of the synthesized speech, while enabling controllable generation of spontaneous behaviors.</p>
<div class="image-container">
<!-- 插入PNG图片 -->
<img src="./assets/images/modelall0310.png" alt="model_all">
<!-- 添加标题 -->
<figcaption>The architecture of our proposed model.</figcaption>
</div>
<style>
/* 设置图片容器样式 */
.image-container {
display: flex; /* 使用弹性布局 */
flex-direction: column; /* 垂直排列子元素 */
align-items: center; /* 水平居中对齐 */
justify-content: center; /* 垂直居中对齐(如果需要) */
text-align: center; /* 确保标题文本也居中 */
margin: 0 auto; /* 让容器本身水平居中(适用于有固定宽度的情况) */
max-width: 600px; /* 可选:限制容器的最大宽度 */
}
/* 图片样式 */
.image-container img {
max-width: 100%; /* 确保图片适应容器宽度 */
height: auto; /* 保持图片比例 */
}
/* 标题样式 */
.image-container figcaption {
font-size: 16px; /* 标题字体大小 */
font-weight: bold; /* 加粗标题 */
margin-top: 10px; /* 标题与图片之间的间距 */
}
</style>
<h1 id="Audio samples for different models">Audio samples for different models</h1>
<p>
<li><strong>GT :</strong> ground truth audio.</li>
<li><strong>VALL-E :</strong> An open-source implementation3 of VALL-E. We first conduct pre-training on large-scale Chinese datasets and then fine-tune the model on HQ-conversations.</li>
<li><strong>BASE-LPE :</strong> The VALL-E with the LPE extracted by the local prosody encoder.</li>
<li><strong>BASE-style :</strong> The VALL-E with the style embedding extracted from the personalized speaking style encoder, which comprises semantic, duration, prosody, and timbre information.</li>
<li><strong>PerTTS :</strong> PerTTS This is our proposed personalized, controllable zero-shot spontaneous style speech synthesis model. It consists of the backbone of VALL-E, along with a personalized speaking style encoder and a label encoder. In this model, we assume that the spontaneous labels are given.</li>
<li><strong>PerTTS(w/o label) / w/ style emb input :</strong> The same architecture of PerTTS, where the pseudo labels for spontaneous behaviors are obtained from the output of the NAR predictor.</li>
<li><strong>PerTTS(w/o label) / w/o style emb input :</strong> The pseudo labels for spontaneous behaviors are obtained from the output of the NAR predictor which was trained without style embedding.</li>
</p>
<h3 id="Group1">Group1</h3>
<table>
<thead>
<tr>
<th style="text-align: left">Target Chinese Text</th>
<th style="text-align: left">Prompt Speech</th>
<th style="text-align: left">GT</th>
<th style="text-align: left">VALL-E</th>
<th style="text-align: left">BASE-LPE</th>
<th style="text-align: left">BASE-style(Proposed)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">地势还是啥反正就是,各种各样的环境都非常多样化,所以它的景色也非常的丰富。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/prompt/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/GT/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/bs/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/LPE/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/Ours/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">就是对一些基础的问题,但是真的可以回答的很好。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/prompt/pG0012_S0020_val_targe_G0012_S0173.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/GT/pG0012_S0020_val_targe_G0012_S0173.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/bs/pG0012_S0020_val_targe_G0012_S0173.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/LPE/pG0012_S0020_val_targe_G0012_S0173.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/Ours/pG0012_S0020_val_targe_G0012_S0173.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">它就是通过人工智能哎给咱们推荐,你喜欢哪个视频呀你不喜欢哪个视频是吧,通过这个推送。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/prompt/pG0209_S0095_val_targe_G0209_S0113.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/GT/pG0209_S0095_val_targe_G0209_S0113.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/bs/pG0209_S0095_val_targe_G0209_S0113.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/LPE/pG0209_S0095_val_targe_G0209_S0113.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/Ours/pG0209_S0095_val_targe_G0209_S0113.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">对,确实因为人工智能到现在还没有被这个这个普及我感觉没有被普及。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/prompt/pG0234_S0156_val_targe_G0234_S0024.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/GT/pG0234_S0156_val_targe_G0234_S0024.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/bs/pG0234_S0156_val_targe_G0234_S0024.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/LPE/pG0234_S0156_val_targe_G0234_S0024.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/Ours/pG0234_S0156_val_targe_G0234_S0024.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">像像我妈那种工工薪阶层基本都是。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/prompt/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/GT/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/bs/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/LPE/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/Ours/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">呃,喜欢泡的还是,直接就是喝水的。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/prompt/pG0515_S0317_val_targe_G0515_S0242.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/GT/pG0515_S0317_val_targe_G0515_S0242.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/bs/pG0515_S0317_val_targe_G0515_S0242.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/LPE/pG0515_S0317_val_targe_G0515_S0242.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group1/Ours/pG0515_S0317_val_targe_G0515_S0242.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
<hr>
<h3 id="Group2">Group2</h3>
<table>
<thead>
<tr>
<th style="text-align: left">Target Chinese Text</th>
<th style="text-align: left">GT</th>
<th style="text-align: left">BASE-style</th>
<th style="text-align: left">PerTTS(Proposed)</th>
<th style="text-align: left">PerTTS(w/o label) / w/ style emb input</th>
<th style="text-align: left">PerTTS(w/o label) / w/o style emb input</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">也也给别的地方拉一下旅游,拉拉动一下旅游产业的发展。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/prompt/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/Premodel/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWlabel/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWoLabel/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/bsWoLabel/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">工地好像并不是技术,工地是靠蛮力啊,靠力气。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/prompt/pG0515_S0317_val_targe_G0515_S0090.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/Premodel/pG0515_S0317_val_targe_G0515_S0090.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWlabel/pG0515_S0317_val_targe_G0515_S0090.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWoLabel/pG0515_S0317_val_targe_G0515_S0090.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/bsWoLabel/pG0515_S0317_val_targe_G0515_S0090.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">那么你一个人管得来台球馆吗?</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/prompt/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/Premodel/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWlabel/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWoLabel/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/bsWoLabel/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">让他去带动另一批人,然后就让另一批人去调动下一批人,循环往复嘛对吧,形成一个良性循环。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/prompt/pG0209_S0095_val_targe_G0209_S0247.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/Premodel/pG0209_S0095_val_targe_G0209_S0247.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWlabel/pG0209_S0095_val_targe_G0209_S0247.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWoLabel/pG0209_S0095_val_targe_G0209_S0247.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/bsWoLabel/pG0209_S0095_val_targe_G0209_S0247.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">然后,他那个就说是预防,预防你的就是预防女生的宫颈癌呀还有这个胸胸什么。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/prompt/pG0072_S0477_val_targe_G0072_S0514.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/Premodel/pG0072_S0477_val_targe_G0072_S0514.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWlabel/pG0072_S0477_val_targe_G0072_S0514.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWoLabel/pG0072_S0477_val_targe_G0072_S0514.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/bsWoLabel/pG0072_S0477_val_targe_G0072_S0514.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">大多数都很少有冲劲了,我最近不是在看那个电视剧嘛,看那个觉醒年代你有看过吗?</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/prompt/pG0012_S0020_val_targe_G0012_S0454.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/Premodel/pG0012_S0020_val_targe_G0012_S0454.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWlabel/pG0012_S0020_val_targe_G0012_S0454.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/OursWoLabel/pG0012_S0020_val_targe_G0012_S0454.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/Group2/bsWoLabel/pG0012_S0020_val_targe_G0012_S0454.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
<hr>
<h3 id="ABX">ABX</h1>
<p>
Comparison of BASE-style and PerTTS in spontaneity and naturalness.
</p>
<table>
<thead>
<tr>
<th style="text-align: left">Target Chinese Text</th>
<th style="text-align: left">BASE-style</th>
<th style="text-align: left">PerTTS(Proposed)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">也也给别的地方拉一下旅游,拉拉动一下旅游产业的发展。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/Premodel/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/OursWlabel/pG0012_S0020_val_targe_G0012_S0116.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">对我我这个人也是我现在这个男朋友他老是喜欢打游戏他一打游戏我就感觉。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/Premodel/pG0072_S0477_val_targe_G0072_S0185.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/OursWlabel/pG0072_S0477_val_targe_G0072_S0185.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">那么你一个人管得来台球馆吗?</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/Premodel/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/OursWlabel/pG0515_S0317_val_targe_G0515_S0009.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">上几百万的粉丝他一次广告可能就要十几万几十万。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/Premodel/pG0293_S0447_val_targe_G0293_S0408.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/ABX-abx_Spon/OursWlabel/pG0293_S0447_val_targe_G0293_S0408.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
<h1 id="ablation-study">Ablation Study</h1>
<h3 id="investigation on speaker embedding">investigation on speaker embedding</h3>
<p>
Compare timbre similarity.
</p>
<table>
<thead>
<tr>
<th style="text-align: left">GT</th>
<th style="text-align: left">BASE-style</th>
<th style="text-align: left">without speaker embedding</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/GT/pG0012_S0020_val_targe_G0012_S0276.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/Ours/pG0012_S0020_val_targe_G0012_S0276.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/woSPK/pG0012_S0020_val_targe_G0012_S0276.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/GT/pG0072_S0477_val_targe_G0072_S0563.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/Ours/pG0072_S0477_val_targe_G0072_S0563.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/woSPK/pG0072_S0477_val_targe_G0072_S0563.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/GT/pG0333_S0028_val_targe_G0333_S0016.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/Ours/pG0333_S0028_val_targe_G0333_S0016.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/woSPK/pG0333_S0028_val_targe_G0333_S0016.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/GT/pG0515_S0317_val_targe_G0515_S0350.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/Ours/pG0515_S0317_val_targe_G0515_S0350.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos2_timbre_sim/woSPK/pG0515_S0317_val_targe_G0515_S0350.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
<h3 id="investigation on bert embedding and duration embedding">investigation on bert embedding and duration embedding</h3>
<p>
Compare style similarity.
</p>
<table>
<thead>
<tr>
<th style="text-align: left">Target Chinese Text</th>
<th style="text-align: left">GT</th>
<th style="text-align: left">BASE-style</th>
<th style="text-align: left">without bert embedding and duration embedding</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">地势还是啥反正就是,各种各样的环境都非常多样化,所以它的景色也非常的丰富。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/GT/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/Ours/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/LPE/pG0012_S0020_val_targe_G0012_S0040.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">你说这之后这个人工智能,是不是会越来越便利,就是说体会咱们这个生活的。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/GT/pG0209_S0095_val_targe_G0209_S0012.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/Ours/pG0209_S0095_val_targe_G0209_S0012.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/LPE/pG0209_S0095_val_targe_G0209_S0012.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">像像我妈那种工工薪阶层基本都是。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/GT/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/Ours/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/LPE/pG0293_S0447_val_targe_G0293_S0404.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">最近看你天天吃泡面啊,然后你那个。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/GT/pG0515_S0317_val_targe_G0515_S0373.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/Ours/pG0515_S0317_val_targe_G0515_S0373.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CM-cmos1_style_sim/LPE/pG0515_S0317_val_targe_G0515_S0373.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</table>
<h1 id="case study">Controllable of spontaneous behaviors</h1>
<p><strong>NOTE:</strong> the character with spontaneous label (in GT) is bolded</p>
<table>
<thead>
<tr>
<th style="text-align: left">Target Chinese Text</th>
<th style="text-align: left">GT</th>
<th style="text-align: left">PerTTS(proposed)</th>
<th style="text-align: left">PerTTS(w/o label)</th>
<th style="text-align: left">No Label</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><strong>呃</strong>,喜欢泡的还<strong>是</strong>,直接就<strong>是</strong>喝水的。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0515_S0317_val_targe_G0515_S0242-gt.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0515_S0317_val_targe_G0515_S0242-P12pseudoWlabel-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0515_S0317_val_targe_G0515_S0242-P12pseudoPred-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0515_S0317_val_targe_G0515_S0242-P12pseudo-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">就就只能干那<strong>种</strong>体力活的,那他肯定就往那些放那那些岗位上面去了。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0012_S0020_val_targe_G0012_S0276-gt.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0012_S0020_val_targe_G0012_S0276-P12pseudoWlabel-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0012_S0020_val_targe_G0012_S0276-P12pseudoPred-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0012_S0020_val_targe_G0012_S0276-P12pseudo-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">怎么说呢我觉得这个服务员他这<strong>个</strong>行业啊,<strong>嗯</strong>反正竞争也不是特别大吧,但<strong>是</strong>就很<strong>累</strong>。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0234_S0156_val_targe_G0234_S0581-gt.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0234_S0156_val_targe_G0234_S0581-P12pseudoWlabel-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0234_S0156_val_targe_G0234_S0581-P12pseudoPred-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0234_S0156_val_targe_G0234_S0581-P12pseudo-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="text-align: left">啧,我觉得这<strong>个</strong>互联<strong>网上</strong>去混饭呢都<strong>是</strong>看缘分看<strong>天</strong>赏饭吃。</td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0293_S0447_val_targe_G0293_S0413-gt.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0293_S0447_val_targe_G0293_S0413-P12pseudoWlabel-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0293_S0447_val_targe_G0293_S0413-P12pseudoPred-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
<td style="text-align: left"><audio controls=""><source src="./wavs/CaseStudy/pG0293_S0447_val_targe_G0293_S0413-P12pseudo-ft-woFreeze_vocos.wav" type="audio/wav">Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</section>
</div>
<!-- FOOTER -->
<div id="footer_wrap" class="outer">
<footer class="inner">
<p class="copyright">PerTTS: Personalized and Controllable Zero-shot Spontaneous Style Text-to-Speech Synthesis maintained by <a href="https://github.com/kangkangready">kangkangready</a></p>
<p>Published with <a href="https://pages.github.com">GitHub Pages</a></p>
</footer>
</div>
</body></html>