Skip to content

Spark - "spark.sql.shuffle.partitions" set to "auto" is unsupported (crashes) #2898

@mbell697

Description

@mbell697

What happens?

If you do not provide num_partitions_on_repartition to SparkAPI, the code that attempts to determine this setting automatically consults self.spark.conf.get("spark.default.parallelism") and self.spark.conf.get("spark.sql.shuffle.partitions"), see https://github.com/moj-analytical-services/splink/blob/master/splink/internals/spark/database_api.py#L212

However this doesn't support the case where either of these settings are present but not numerical, it results in a string being left in parallelism_value, later causing an except when the division by 2 happens below on line 217.

In particular spark.sql.shuffle.partitions is often "auto" on databricks clusters.

To Reproduce

Instantiate SparkAPI without num_partitions_on_repartition and spark.sql.shuffle.partitions set to "auto"

OS:

Databricks 16.4

Splink version:

4.0.12

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions