Description
What happened?
When running Beam on Spark using WriteToParquet
without num_shards
, files seem to be written with no parallelism. In https://beam.apache.org/releases/pydoc/2.11.0/apache_beam.io.parquetio.html it says:
num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards.
However, in Spark, my tasks looks like this:
I believe that this is happening because iobase.WriteImpl
in here is doing:
...
| 'Pair' >> core.Map(lambda x: (None, x))
| core.GroupByKey()
which was added in this PR: #958
If I understand correctly, the pcollection elements will all have the same key, None
, and GroupByKey
will group all those elements into a single "partition" (in Spark terms). This "None" partition is massively skewed and can only be written by 1 thread / task and will take forever.
Issue Priority
Priority: 2
Issue Component
Component: io-py-parquet
Activity
Abacn commentedon Nov 28, 2022
Thanks for reporting and triaging the issue. Surprised by "adding a
None
key then GroupByKey" which makes no sense today, but probably how GBK works has since changed. We should be able to replace the change in #958 to a ReShuffle(). Would you mind testing if it resolves your issue and appreciate if opening a PR?cozos commentedon Nov 28, 2022
Hi @Abacn, thanks for your response.
Upon a closer reading of iobase._WriteBundleDoFn, I realized that it does not actually return a pcollection of all elements, but rather a pcollection of all the file paths that the elements were written to. This makes the
None
GroupByKey
a bit better, as the shuffle skew is only happening on the number of files (several hundred or thousands) which is a much smaller magnitude than elements/rows (millions).With this in mind, the poor performance from the GroupByKey is perplexing, especially since it seems to work fine in GCP Dataflow but not Spark. Any ideas?
Here is where my Spark job is stuck on the Beam-to-Spark translation:
I will give this a try.
Thanks
Abacn commentedon Nov 28, 2022
@cozos This issue attracted me partly because, actually, Python text IO write has obviously worse performance than Java SDK:
Java metrics: http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1&viewPanel=4
Python metrics: http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1&viewPanel=5
Java Read ~20s; Java Write ~30s; Python Read ~100s; Python Write 350s
I recently implemented this Python performance test and found this. And trying to figure out the performance bottlenecks in Python file based IOs.
That said Dataflow may also be affected.
cozos commentedon Nov 29, 2022
I see, interesting. What I am experiencing in Beam on Spark is not "Python is much slower than Java", it's more like "WriteToParquet does not work at all for moderate sized data". Nevertheless, please keep me posted on your Python performance investigations.
By the way, I tried replacing
GroupByKey
withReshuffle
and it did not help for my Spark pipeline. I am now trying to remove all shuffles before finalization completely.cozos commentedon Nov 29, 2022
@Abacn Can you shine some light on why we want to trigger a reshuffle here in the first place?
Abacn commentedon Nov 29, 2022
@cozos Thanks for the followup. Did not have much knowlesge on Spark runner but improvement on python file based io is an ongoing work. Will keep updates of course.
mosche commentedon Nov 30, 2022
Similarly, for the read side see #24422
kennknowles commentedon Dec 1, 2022
Looking at this, is the assumption that there are very few elements coming out from the write fn?
cozos commentedon Dec 2, 2022
Upon thinking about this further I think the bottleneck came from the reader problem I had in here #24422
Basically all Reads only happen on 1 partition on runners that don't support SDF. But this issue was being obscured by Spark UI stage showing the job stuck at the shuffle boundary which came from WriteToParquet (when in reality the bottleneck was at the Read).
We can close this issue but I don't know if the
GroupByKey
onNone
is still a problem we want to track (as it could also cause a bottleneck).