Hi, thank you for releasing this valuable speech-text dataset. I have a question regarding the speech unit representation.
According to the paper, the speech units are derived from HuBERT features and clustered into 1000 discrete units ("To extract discrete units corresponding to the target speech, we use a pre-trained K-means quantizer, which has learned 1000 clusters from the HuBERT features.").
However, in the released dataset, one can observe that many unit values exceed 1000 — for example:
[
{
"from": "human",
"speech": "data/multiturn_instruction/en/wav/instruct_en_3/instruct_en_3-1-user.wav",
"text": "Hey, can you tell me how we can, like, reduce air pollution?",
"unit": null
},
{
"from": "gpt",
"speech": "data/multiturn_instruction/en/wav/instruct_en_3/instruct_en_3-1-assistant.wav",
"text": "To reduce air pollution, use public transport, walk or bike, avoid burning fossil fuels, and recycle. Also, use energy-efficient appliances and turn them off when not in use.",
"unit": "<2957><4299><4299><4299><4299><2058><60><3651><5835><5835><5835><3648><3645><4056><4299><2112><4677><2382><6071><2152><3884><2672><546><403><6317><5580><4441><5726><1524><1978><359><4787><6029><6049><2160><2166><3483><4879><4207><3476><3314><1940><3712><6048><6051><3054><116><2204><3670><1970><4157><5642><6074><6311><5179><4671><2112><2160><5346><5356><2810><2184><2826><4691><231><4299><60><6051><4050><6032><4787><4382><4393><4182><5648><1164><2166><3567><692><1406><1676><1688><1762><4299><1959><4287><4372><2906><5030><6401><6401><3402><489><1336><2059><1457><6536><3863><2178><4299><1431><5753><5834><2195><305><638><2167><1707><165><1869><3648><5832><5835><3648><1701><1459><4299><1950><1244><2810><2097><6537><3640><3644><2897><221><4859><1941><2166><2160><4106><6050><6031><6015><4534><5074><6424><4176><6534><6537><2837><3491><6323><4700><4644><4996><4995><2831><5093><6381><6510><6456><3111><5020><6070><4289><1457><719><626><2822><3475><5725><5643><1762><1951><2924><5111><3667><4340><1685><5173><4753><6374><2727><5752><3649><2627><2625><975><423><548><629><2915><4308><6486><4299><4299><4218><4299><4218><4299><3975><3984><4299><4299><1761><1488><4299><4299><6405><1949><6404><4949><5030><5088><5643><4916><4914><308><299><644><1454><1772><5668><3480><6073><4043><4437><2493><1763><1736><5192><4382><3666><6032><6032><1500><6051><5587><5100><2006><2420><3672><6534><6537><4023><4595><4525><6048><6051><3051><4478><3831><4431><389><2163><2166><3510><2850><5048><5675><6329><5279><4796><2447><749><4200><5645><5644><1532><803><2990><4672><5400><1788><1736><2924><2209><1887><2490><3108><3735><6266><6050><4557><6099><2224><1152><4314><662><5755><5752><4133><2027><4239><6534><6510><1430><677><187><3752><3669><3728><6401><6401><1463><2005><62><2211><4156><4156><4913><5588><6074><3071><1775><1856><5211><4752><4672><4671><5643><1488><4299><4299><4299>"
},
...]
Could you please clarify:
- Which model was used to generate these speech units?
- Were more than 1000 clusters used during quantization? If so, was it different from what was described in the paper?
Looking forward to your clarification. Thanks again for your great work!
Best regards,
Hi, thank you for releasing this valuable speech-text dataset. I have a question regarding the speech unit representation.
According to the paper, the speech units are derived from HuBERT features and clustered into 1000 discrete units ("To extract discrete units corresponding to the target speech, we use a pre-trained K-means quantizer, which has learned 1000 clusters from the HuBERT features.").
However, in the released dataset, one can observe that many unit values exceed 1000 — for example:
Could you please clarify:
Looking forward to your clarification. Thanks again for your great work!
Best regards,