data_prep format

Hello!
question:
in data_prep if I use --concat_tokens k, its divide into chunks of k tokens my all data, but if I want to just take sample from my data and truncate by max_tokens or add pad tokens to max_tokens (for each sample from my data)? How it can be done in llm-foundry?

--concat_tokens 2
["some", "text"] -> ["so", "me", "te", "xt"]
I want:
max_len=3
["some", "text", "h"] -> ["som", "tex", "h<pad><pad>"]

I know in pretrain LLMs it's useless but in sft I also don't find this data_prep in llm-foundry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

data_prep format #1785

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

data_prep format #1785

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions