Open
Description
Search before asking
- I searched the issues and found no similar issues.
Component
Other, Transforms/Other
What happened + What you expected to happen
from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
output_folder= "files-rep_removal",
rep_removal_contents_column_name='text',
rep_removal_num_threads=1,
).transform()
12:11:53 INFO - pipeline id pipeline_id
12:11:53 INFO - code location None
12:11:53 INFO - data factory data_ is using local data access: input_folder - [/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20](http://localhost:8888/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20) output_folder - files-rep_removal
12:11:53 INFO - data factory data_ max_files -1, n_sample -1
12:11:53 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:11:53 INFO - orchestrator rep_removal started at 2025-02-10 12:11:53
12:11:53 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
12:12:16 INFO - encoding parquet
12:51:53 INFO - making suffix array
12:51:53 INFO - Starting the deduplication process for file: [/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet](http://localhost:8888/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet)
cpu speed: 3228 MHz, Cores: 10
12:51:53 INFO - timeout is: 45743.31654275093
12:51:53 INFO - Scheduling 96 jobs to create dataset parts.
gpu_usage: 0.00%, GPU speed: 0 MHz
Reproduction script
Run the following on a Mac M1 with 16GB memory
REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
output_folder= "files-rep_removal",
rep_removal_contents_column_name='text',
rep_removal_num_threads=1,
).transform()
Anything else
No response
OS
MacOS (limited support)
Python
3.10.x
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Activity