Skip to content

data_prep format #1785

Open
Open
@tsebaka

Description

@tsebaka

Hello!
question:
in data_prep if I use --concat_tokens k, its divide into chunks of k tokens my all data, but if I want to just take sample from my data and truncate by max_tokens or add pad tokens to max_tokens (for each sample from my data)? How it can be done in llm-foundry?

--concat_tokens 2
["some", "text"] -> ["so", "me", "te", "xt"]
I want:
max_len=3
["some", "text", "h"] -> ["som", "tex", "h"]

I know in pretrain LLMs it's useless but in sft I also don't find this data_prep in llm-foundry

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions