Skip to content

DBSCAN performance regression in 1.8.1 #2589

@nullbutt

Description

@nullbutt

Hi, flagging that I've noticed a significant performance drop in DBSCAN following upgrade to 1.8.1. My jobs are taking ~4x the time they took before when using 1.8.0 and tasks seem to have doubled.

Perhaps related to the graphframes upgrade and changes to connected components?
graphframes/graphframes#758

Perf Comparison

Metric Before (GF 0.9.2) After (GF 0.10.0)
Duration 4m 1s 15m 53s
Spark tasks 25,310 53,188
Shuffle write 12.3MB 20.2MB

What I've Tried

None of these resolved the issue:

  1. Disabling AQE
   spark.conf.set("spark.sql.adaptive.enabled", "false")

Result: Duration improved slightly to ~14 min, but tasks increased to 75K

  1. Setting broadcastThreshold to -1 with AQE enabled (as recommended in GraphFrames issue)
   spark.conf.set("spark.sql.adaptive.enabled", "true")
   spark.conf.set("graphframes.connected.components.broadcastThreshold", "-1")

Result: No significant improvement

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions