Skip to content

[Bug]: No parallelism using WriteToParquet in Apache Spark #24365

Open
@cozos

Description

@cozos

What happened?

When running Beam on Spark using WriteToParquet without num_shards, files seem to be written with no parallelism. In https://beam.apache.org/releases/pydoc/2.11.0/apache_beam.io.parquetio.html it says:

num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards.

However, in Spark, my tasks looks like this:
Screen Shot 2022-11-24 at 11 39 42 PM

I believe that this is happening because iobase.WriteImpl in here is doing:

    ...
    | 'Pair' >> core.Map(lambda x: (None, x))
    | core.GroupByKey()

which was added in this PR: #958

If I understand correctly, the pcollection elements will all have the same key, None, and GroupByKey will group all those elements into a single "partition" (in Spark terms). This "None" partition is massively skewed and can only be written by 1 thread / task and will take forever.

Issue Priority

Priority: 2

Issue Component

Component: io-py-parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions