-
Notifications
You must be signed in to change notification settings - Fork 745
Open
Description
Hi, flagging that I've noticed a significant performance drop in DBSCAN following upgrade to 1.8.1. My jobs are taking ~4x the time they took before when using 1.8.0 and tasks seem to have doubled.
Perhaps related to the graphframes upgrade and changes to connected components?
graphframes/graphframes#758
Perf Comparison
| Metric | Before (GF 0.9.2) | After (GF 0.10.0) |
|---|---|---|
| Duration | 4m 1s | 15m 53s |
| Spark tasks | 25,310 | 53,188 |
| Shuffle write | 12.3MB | 20.2MB |
What I've Tried
None of these resolved the issue:
- Disabling AQE
spark.conf.set("spark.sql.adaptive.enabled", "false")Result: Duration improved slightly to ~14 min, but tasks increased to 75K
- Setting broadcastThreshold to -1 with AQE enabled (as recommended in GraphFrames issue)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("graphframes.connected.components.broadcastThreshold", "-1")Result: No significant improvement
Metadata
Metadata
Assignees
Labels
No labels