Open
Description
Search before asking
- I searched the issues and found no similar issues.
Component
- We would like to add Bloom annotator transform which maps a non-empty input table to an output table with an added is_in_GneissWeb column. Each row in the table corresponds to a UUID and its associated document. The Bloom annotator transform verifies whether the document's UUID exists in the GneissWeb Bloom filter.
Feature
-
The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter.
-
Please refer to README file submitted in the PR for examples.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!