Open
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- I am reporting the issue to the correct repository. (Model Garden official or research directory)
- I checked to make sure that this issue has not been filed already.
1. The entire URL of the file you are using
2. Describe the bug
- Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
- I put the object storage and compute engine in the same region.
- I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
- When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
- I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
- I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.
3. Steps to reproduce
- Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
- Compute Engine c2d-highcpu-32 in europe-west1-b
- Specify STORAGE_BUCKET, PROJECT, REGION
- Run the python command above.
4. Expected behavior
- Apache Beam Pipeline can maximize the number of running workers
6. System information
- OS Platform and Distribution : Linux 6.1.0-18-cloud-amd64 x86_64
- TensorFlow installed from (source or binary): setup.py
- TensorFlow version: 2.15.0
- Python version: 3.9.2