Jaro winkler in Spark and Databricks

At the moment we recommend and pre-register our custom jar, but this has caused problems.

I've run some benchmarking that suggests that registering the rapidfuzz implementation from Python is around 4x slower than using our custom UDF.  

For many users, this slowdown may be acceptable for the increasing simplicity of not needing the UDF.

Here's a gist that runs the benchmarking:
https://gist.github.com/RobinL/c3302934984fb510264328ff163c8233


```
============================================================
BENCHMARK RESULTS (ARROW UDF - VECTORIZED CPDIST)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.92
Total time: 9.72 seconds
Throughput: 10,291,236 comparisons/second
============================================================
                                                 
============================================================
BENCHMARK RESULTS (ARROW UDF - PYTHON LOOP)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.61
Total time: 9.78 seconds
Throughput: 10,223,212 comparisons/second
============================================================
                                                                                   
============================================================
BENCHMARK RESULTS (STANDARD PYTHON UDF - ROW-WISE)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38927666.61
Total time: 9.86 seconds
Throughput: 10,137,041 comparisons/second
============================================================
                                              
============================================================
BENCHMARK RESULTS (SCALA/Java UDF - JAR)
============================================================
Comparisons performed: 100,000,000
Total similarity sum: 38859504.33
Total time: 2.04 seconds
Throughput: 48,920,942 comparisons/second
```



In databricks in particular, it's possible users can just register the functions themselves as follows?  We should get someone to test this:

```
import org.apache.commons.text.similarity.JaroWinklerSimilarity

val jw = new JaroWinklerSimilarity()

spark.udf.register("jaro_winkler_sim", (s1: String, s2: String) =>
  if (s1 == null || s2 == null) null else jw.apply(s1, s2).doubleValue()
)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaro winkler in Spark and Databricks #2875

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Jaro winkler in Spark and Databricks #2875

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions