Skip to content

Clarification on Speech Units Exceeding 1000 in the Released Dataset #66

@soaring0616

Description

@soaring0616

Hi, thank you for releasing this valuable speech-text dataset. I have a question regarding the speech unit representation.

According to the paper, the speech units are derived from HuBERT features and clustered into 1000 discrete units ("To extract discrete units corresponding to the target speech, we use a pre-trained K-means quantizer, which has learned 1000 clusters from the HuBERT features.").

However, in the released dataset, one can observe that many unit values exceed 1000 — for example:

[
{
"from": "human",
"speech": "data/multiturn_instruction/en/wav/instruct_en_3/instruct_en_3-1-user.wav",
"text": "Hey, can you tell me how we can, like, reduce air pollution?",
"unit": null
},
{
"from": "gpt",
"speech": "data/multiturn_instruction/en/wav/instruct_en_3/instruct_en_3-1-assistant.wav",
"text": "To reduce air pollution, use public transport, walk or bike, avoid burning fossil fuels, and recycle. Also, use energy-efficient appliances and turn them off when not in use.",
"unit": "<2957><4299><4299><4299><4299><2058><60><3651><5835><5835><5835><3648><3645><4056><4299><2112><4677><2382><6071><2152><3884><2672><546><403><6317><5580><4441><5726><1524><1978><359><4787><6029><6049><2160><2166><3483><4879><4207><3476><3314><1940><3712><6048><6051><3054><116><2204><3670><1970><4157><5642><6074><6311><5179><4671><2112><2160><5346><5356><2810><2184><2826><4691><231><4299><60><6051><4050><6032><4787><4382><4393><4182><5648><1164><2166><3567><692><1406><1676><1688><1762><4299><1959><4287><4372><2906><5030><6401><6401><3402><489><1336><2059><1457><6536><3863><2178><4299><1431><5753><5834><2195><305><638><2167><1707><165><1869><3648><5832><5835><3648><1701><1459><4299><1950><1244><2810><2097><6537><3640><3644><2897><221><4859><1941><2166><2160><4106><6050><6031><6015><4534><5074><6424><4176><6534><6537><2837><3491><6323><4700><4644><4996><4995><2831><5093><6381><6510><6456><3111><5020><6070><4289><1457><719><626><2822><3475><5725><5643><1762><1951><2924><5111><3667><4340><1685><5173><4753><6374><2727><5752><3649><2627><2625><975><423><548><629><2915><4308><6486><4299><4299><4218><4299><4218><4299><3975><3984><4299><4299><1761><1488><4299><4299><6405><1949><6404><4949><5030><5088><5643><4916><4914><308><299><644><1454><1772><5668><3480><6073><4043><4437><2493><1763><1736><5192><4382><3666><6032><6032><1500><6051><5587><5100><2006><2420><3672><6534><6537><4023><4595><4525><6048><6051><3051><4478><3831><4431><389><2163><2166><3510><2850><5048><5675><6329><5279><4796><2447><749><4200><5645><5644><1532><803><2990><4672><5400><1788><1736><2924><2209><1887><2490><3108><3735><6266><6050><4557><6099><2224><1152><4314><662><5755><5752><4133><2027><4239><6534><6510><1430><677><187><3752><3669><3728><6401><6401><1463><2005><62><2211><4156><4156><4913><5588><6074><3071><1775><1856><5211><4752><4672><4671><5643><1488><4299><4299><4299>"
},
...]

Could you please clarify:

  • Which model was used to generate these speech units?
  • Were more than 1000 clusters used during quantization? If so, was it different from what was described in the paper?

Looking forward to your clarification. Thanks again for your great work!

Best regards,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions