Clarification on Speech Units Exceeding 1000 in the Released Dataset

Hi, thank you for releasing this valuable speech-text [dataset](https://huggingface.co/datasets/ICTNLP/InstructS2S-200K). I have a question regarding the speech unit representation.

According to the paper, the speech units are derived from HuBERT features and clustered into 1000 discrete units ("To extract discrete units corresponding to the target speech, we use a pre-trained K-means quantizer, which has learned 1000 clusters from the HuBERT features."). 

However, in the released dataset, one can observe that many unit values exceed 1000 — for example:

```
[
{
"from": "human",
"speech": "data/multiturn_instruction/en/wav/instruct_en_3/instruct_en_3-1-user.wav",
"text": "Hey, can you tell me how we can, like, reduce air pollution?",
"unit": null
},
{
"from": "gpt",
"speech": "data/multiturn_instruction/en/wav/instruct_en_3/instruct_en_3-1-assistant.wav",
"text": "To reduce air pollution, use public transport, walk or bike, avoid burning fossil fuels, and recycle. Also, use energy-efficient appliances and turn them off when not in use.",
"unit": "<2957><4299><4299><4299><4299><2058><60><3651><5835><5835><5835><3648><3645><4056><4299><2112><4677><2382><6071><2152><3884><2672><546><403><6317><5580><4441><5726><1524><1978><359><4787><6029><6049><2160><2166><3483><4879><4207><3476><3314><1940><3712><6048><6051><3054><116><2204><3670><1970><4157><5642><6074><6311><5179><4671><2112><2160><5346><5356><2810><2184><2826><4691><231><4299><60><6051><4050><6032><4787><4382><4393><4182><5648><1164><2166><3567><692><1406><1676><1688><1762><4299><1959><4287><4372><2906><5030><6401><6401><3402><489><1336><2059><1457><6536><3863><2178><4299><1431><5753><5834><2195><305><638><2167><1707><165><1869><3648><5832><5835><3648><1701><1459><4299><1950><1244><2810><2097><6537><3640><3644><2897><221><4859><1941><2166><2160><4106><6050><6031><6015><4534><5074><6424><4176><6534><6537><2837><3491><6323><4700><4644><4996><4995><2831><5093><6381><6510><6456><3111><5020><6070><4289><1457><719><626><2822><3475><5725><5643><1762><1951><2924><5111><3667><4340><1685><5173><4753><6374><2727><5752><3649><2627><2625><975><423><548><629><2915><4308><6486><4299><4299><4218><4299><4218><4299><3975><3984><4299><4299><1761><1488><4299><4299><6405><1949><6404><4949><5030><5088><5643><4916><4914><308><299><644><1454><1772><5668><3480><6073><4043><4437><2493><1763><1736><5192><4382><3666><6032><6032><1500><6051><5587><5100><2006><2420><3672><6534><6537><4023><4595><4525><6048><6051><3051><4478><3831><4431><389><2163><2166><3510><2850><5048><5675><6329><5279><4796><2447><749><4200><5645><5644><1532><803><2990><4672><5400><1788><1736><2924><2209><1887><2490><3108><3735><6266><6050><4557><6099><2224><1152><4314><662><5755><5752><4133><2027><4239><6534><6510><1430><677><187><3752><3669><3728><6401><6401><1463><2005><62><2211><4156><4156><4913><5588><6074><3071><1775><1856><5211><4752><4672><4671><5643><1488><4299><4299><4299>"
},
...]

```

Could you please clarify:

- Which model was used to generate these speech units?
- Were more than 1000 clusters used during quantization? If so, was it different from what was described in the paper?

Looking forward to your clarification. Thanks again for your great work!

Best regards,



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Speech Units Exceeding 1000 in the Released Dataset #66

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarification on Speech Units Exceeding 1000 in the Released Dataset #66

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions