Skip to content

Jaro winkler in Spark and Databricks #2875

@RobinL

Description

@RobinL

At the moment we recommend and pre-register our custom jar, but this has caused problems.

I've run some benchmarking that suggests that registering the rapidfuzz implementation from Python is around 4x slower than using our custom UDF.

For many users, this slowdown may be acceptable for the increasing simplicity of not needing the UDF.

Here's a gist that runs the benchmarking:
https://gist.github.com/RobinL/c3302934984fb510264328ff163c8233

============================================================
BENCHMARK RESULTS (ARROW UDF - VECTORIZED CPDIST)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.92
Total time: 9.72 seconds
Throughput: 10,291,236 comparisons/second
============================================================
                                                 
============================================================
BENCHMARK RESULTS (ARROW UDF - PYTHON LOOP)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.61
Total time: 9.78 seconds
Throughput: 10,223,212 comparisons/second
============================================================
                                                                                   
============================================================
BENCHMARK RESULTS (STANDARD PYTHON UDF - ROW-WISE)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.61
Total time: 9.86 seconds
Throughput: 10,137,041 comparisons/second
============================================================
                                              
============================================================
BENCHMARK RESULTS (SCALA/Java UDF - JAR)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38859504.33
Total time: 2.04 seconds
Throughput: 48,920,942 comparisons/second

In databricks in particular, it's possible users can just register the functions themselves as follows? We should get someone to test this:

import org.apache.commons.text.similarity.JaroWinklerSimilarity

val jw = new JaroWinklerSimilarity()

spark.udf.register("jaro_winkler_sim", (s1: String, s2: String) =>
  if (s1 == null || s2 == null) null else jw.apply(s1, s2).doubleValue()
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions