[Bug]: No parallelism using WriteToParquet in Apache Spark

### What happened?

When running Beam on Spark using `WriteToParquet` without `num_shards`, files seem to be written with no parallelism. In https://beam.apache.org/releases/pydoc/2.11.0/apache_beam.io.parquetio.html it says:

> num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards.


However, in Spark, my tasks looks like this:
![Screen Shot 2022-11-24 at 11 39 42 PM](https://user-images.githubusercontent.com/2646862/204147013-8010caf8-a4c8-49ee-baec-2477e521cf80.png)

I believe that this is happening because `iobase.WriteImpl` in [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L1156-L1157) is doing:

```
    ...
    | 'Pair' >> core.Map(lambda x: (None, x))
    | core.GroupByKey()
```

which was added in this PR: https://github.com/apache/beam/pull/958

If I understand correctly, the pcollection elements will all have the same key, `None`, and `GroupByKey` will group all those elements into a single "partition" (in Spark terms). This "None" partition is massively skewed and can only be written by 1 thread / task and will take forever.


### Issue Priority

Priority: 2

### Issue Component

Component: io-py-parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: No parallelism using WriteToParquet in Apache Spark #24365

What happened?

Issue Priority

Issue Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: No parallelism using WriteToParquet in Apache Spark #24365

Description

What happened?

Issue Priority

Issue Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions