Skip to content

Identifying specific documents used up to a specific training checkpoint #647

@mario-sanz

Description

@mario-sanz

Hi there!

Thank you so much for all your work; it is incredibly helpful for the community.

I have a question I’d like to ask: during the pre-training phase of OLMo-3, is there any way to identify which documents were used to train the model up to a specific checkpoint?

More specifically, I’m interested in the Wikipedia articles used during training. If I download a specific checkpoint (for example, checkpoint 3000), is there a way to know exactly which documents from the Wikipedia subset were seen by the model up to that point?

Thanks in advance for your help!

Best regards,

Mario

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions