Open
Description
Hello!
question:
in data_prep if I use --concat_tokens k, its divide into chunks of k tokens my all data, but if I want to just take sample from my data and truncate by max_tokens or add pad tokens to max_tokens (for each sample from my data)? How it can be done in llm-foundry?
--concat_tokens 2
["some", "text"] -> ["so", "me", "te", "xt"]
I want:
max_len=3
["some", "text", "h"] -> ["som", "tex", "h"]
I know in pretrain LLMs it's useless but in sft I also don't find this data_prep in llm-foundry
Metadata
Metadata
Assignees
Labels
No labels