-
Notifications
You must be signed in to change notification settings - Fork 225
Open
Description
At the moment we recommend and pre-register our custom jar, but this has caused problems.
I've run some benchmarking that suggests that registering the rapidfuzz implementation from Python is around 4x slower than using our custom UDF.
For many users, this slowdown may be acceptable for the increasing simplicity of not needing the UDF.
Here's a gist that runs the benchmarking:
https://gist.github.com/RobinL/c3302934984fb510264328ff163c8233
============================================================
BENCHMARK RESULTS (ARROW UDF - VECTORIZED CPDIST)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.92
Total time: 9.72 seconds
Throughput: 10,291,236 comparisons/second
============================================================
============================================================
BENCHMARK RESULTS (ARROW UDF - PYTHON LOOP)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.61
Total time: 9.78 seconds
Throughput: 10,223,212 comparisons/second
============================================================
============================================================
BENCHMARK RESULTS (STANDARD PYTHON UDF - ROW-WISE)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.61
Total time: 9.86 seconds
Throughput: 10,137,041 comparisons/second
============================================================
============================================================
BENCHMARK RESULTS (SCALA/Java UDF - JAR)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38859504.33
Total time: 2.04 seconds
Throughput: 48,920,942 comparisons/second
In databricks in particular, it's possible users can just register the functions themselves as follows? We should get someone to test this:
import org.apache.commons.text.similarity.JaroWinklerSimilarity
val jw = new JaroWinklerSimilarity()
spark.udf.register("jaro_winkler_sim", (s1: String, s2: String) =>
if (s1 == null || s2 == null) null else jw.apply(s1, s2).doubleValue()
)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels