Description
What happened?
When running Beam on Spark using WriteToParquet
without num_shards
, files seem to be written with no parallelism. In https://beam.apache.org/releases/pydoc/2.11.0/apache_beam.io.parquetio.html it says:
num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards.
However, in Spark, my tasks looks like this:
I believe that this is happening because iobase.WriteImpl
in here is doing:
...
| 'Pair' >> core.Map(lambda x: (None, x))
| core.GroupByKey()
which was added in this PR: #958
If I understand correctly, the pcollection elements will all have the same key, None
, and GroupByKey
will group all those elements into a single "partition" (in Spark terms). This "None" partition is massively skewed and can only be written by 1 thread / task and will take forever.
Issue Priority
Priority: 2
Issue Component
Component: io-py-parquet