Skip to content

Rep_removal for large data files crashes on 16GB memory #1035

Open
@shahrokhDaijavad

Description

@shahrokhDaijavad

Search before asking

  • I searched the issues and found no similar issues.

Component

Other, Transforms/Other

What happened + What you expected to happen

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

12:11:53 INFO - pipeline id pipeline_id
12:11:53 INFO - code location None
12:11:53 INFO - data factory data_ is using local data access: input_folder - [/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20](http://localhost:8888/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20) output_folder - files-rep_removal
12:11:53 INFO - data factory data_ max_files -1, n_sample -1
12:11:53 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:11:53 INFO - orchestrator rep_removal started at 2025-02-10 12:11:53
12:11:53 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
12:12:16 INFO - encoding parquet
12:51:53 INFO - making suffix array
12:51:53 INFO - Starting the deduplication process for file: [/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet](http://localhost:8888/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet)

cpu speed: 3228 MHz, Cores: 10

12:51:53 INFO - timeout is: 45743.31654275093
12:51:53 INFO - Scheduling 96 jobs to create dataset parts.

gpu_usage: 0.00%, GPU speed: 0 MHz

Reproduction script

Run the following on a Mac M1 with 16GB memory

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions