Clarification on Upsampling Strategy for Aggregated Long Texts

Hello, thank you for releasing the LongWanjuan dataset and the accompanying paper — it is a very valuable contribution to the community.

I have a question regarding the data mixture recipe described in the paper. The paper mentions that chaotic long texts are removed and aggregated long texts are upsampled to mitigate imbalance between holistic and aggregated texts. However, I could not find details on the exact upsampling procedure:

By what ratio (or method) are aggregated texts upsampled relative to holistic texts?

Is the upsampling done by simple duplication, weighted sampling during training, or another approach?

It would be great if you could clarify this point, as it would help in better understanding and reproducing your results.

Thank you very much for your time and for sharing this work with the community!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Upsampling Strategy for Aggregated Long Texts #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on Upsampling Strategy for Aggregated Long Texts #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions