-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hello, thank you for releasing the LongWanjuan dataset and the accompanying paper — it is a very valuable contribution to the community.
I have a question regarding the data mixture recipe described in the paper. The paper mentions that chaotic long texts are removed and aggregated long texts are upsampled to mitigate imbalance between holistic and aggregated texts. However, I could not find details on the exact upsampling procedure:
By what ratio (or method) are aggregated texts upsampled relative to holistic texts?
Is the upsampling done by simple duplication, weighted sampling during training, or another approach?
It would be great if you could clarify this point, as it would help in better understanding and reproducing your results.
Thank you very much for your time and for sharing this work with the community!