Skip to content

Data filtering on reasoning requirement #2

@tmabraham

Description

@tmabraham

Papers like BioMed-R1 discuss how many examples in training are knowledge-heavy and not reasoning-heavy and filtering out such examples can help training.

Write a ligthweight script that, given a HuggingFace dataset like https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M or https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K, filters/tags only samples that require reasoning and not just knowledge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions