Skip to content

Bloom annotator implementation for GneissWeb data #981

Open
@shahrokhDaijavad

Description

Search before asking

  • I searched the issues and found no similar issues.

Component

  • We would like to add Bloom annotator transform which maps a non-empty input table to an output table with an added is_in_GneissWeb column. Each row in the table corresponds to a UUID and its associated document. The Bloom annotator transform verifies whether the document's UUID exists in the GneissWeb Bloom filter.

Feature

  • The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter.

  • Please refer to README file submitted in the PR for examples.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions