-
Notifications
You must be signed in to change notification settings - Fork 225
Description
What happens?
If you do not provide num_partitions_on_repartition to SparkAPI, the code that attempts to determine this setting automatically consults self.spark.conf.get("spark.default.parallelism") and self.spark.conf.get("spark.sql.shuffle.partitions"), see https://github.com/moj-analytical-services/splink/blob/master/splink/internals/spark/database_api.py#L212
However this doesn't support the case where either of these settings are present but not numerical, it results in a string being left in parallelism_value, later causing an except when the division by 2 happens below on line 217.
In particular spark.sql.shuffle.partitions is often "auto" on databricks clusters.
To Reproduce
Instantiate SparkAPI without num_partitions_on_repartition and spark.sql.shuffle.partitions set to "auto"
OS:
Databricks 16.4
Splink version:
4.0.12
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree