Identifying specific documents used up to a specific training checkpoint

Hi there!

Thank you so much for all your work; it is incredibly helpful for the community.

I have a question I’d like to ask: during the pre-training phase of OLMo-3, is there any way to identify which documents were used to train the model up to a specific checkpoint?

More specifically, I’m interested in the Wikipedia articles used during training. If I download a specific checkpoint (for example, checkpoint 3000), is there a way to know exactly which documents from the [Wikipedia](https://github.com/allenai/OLMo-core/blob/c757b7c3c15197154c753d883330afbfa4869dcc/src/olmo_core/data/mixes/OLMo-mix-0625-official.txt#L1009-L1016) subset were seen by the model up to that point?

Thanks in advance for your help!

Best regards,

Mario

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying specific documents used up to a specific training checkpoint #647

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Identifying specific documents used up to a specific training checkpoint #647

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions