Skip to content

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Open
@Arith2

Description

@Arith2

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py

2. Describe the bug

  1. Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
  2. I put the object storage and compute engine in the same region.
  3. I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
  4. When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
  5. I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
  6. I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.

3. Steps to reproduce

  1. Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
  2. Compute Engine c2d-highcpu-32 in europe-west1-b
  3. Specify STORAGE_BUCKET, PROJECT, REGION
  4. Run the python command above.

4. Expected behavior

  • Apache Beam Pipeline can maximize the number of running workers

6. System information

  • OS Platform and Distribution : Linux 6.1.0-18-cloud-amd64 x86_64
  • TensorFlow installed from (source or binary): setup.py
  • TensorFlow version: 2.15.0
  • Python version: 3.9.2

Metadata

Metadata

Assignees

Labels

models:officialmodels that come under official repositorytype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions