Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- [x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
- [x] I checked to make sure that this issue has not been filed already.

## 1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py

## 2. Describe the bug

1. Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
2. I put the object storage and compute engine in the same region.
3. I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
4. When I try to generate vocabulary and run "python criteo_preprocess.py   --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*"   --output_path "${STORAGE_BUCKET}/criteo_out/"   --temp_dir "${STORAGE_BUCKET}/criteo_vocab/"   --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000   --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
5. I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running. 
6. I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.

## 3. Steps to reproduce

1. Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
2. Compute Engine c2d-highcpu-32 in europe-west1-b
4. Specify STORAGE_BUCKET, PROJECT, REGION
5. Run the python command above.

## 4. Expected behavior

- Apache Beam Pipeline can maximize the number of running workers


## 6. System information

- OS Platform and Distribution : Linux 6.1.0-18-cloud-amd64 x86_64
- TensorFlow installed from (source or binary): setup.py
- TensorFlow version: 2.15.0
- Python version: 3.9.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

6. System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Description

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

6. System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions