-
Notifications
You must be signed in to change notification settings - Fork 272
[TEXT-235] Add Damerau-Levenshtein distance #687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TEXT-235] Add Damerau-Levenshtein distance #687
Conversation
src/main/java/org/apache/commons/text/similarity/DamerauLevenshteinDistance.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @LorgeN
Thank you for your PR.
It looks like this PR decreases the code coverage. You can run mvn clean package site and look at the JaCoCo report in target\site and go from there to increase the coverage.
You can also run mvn (solo) to run all build checks.
|
I don't know if we can accept this PR due to possible provenance incompatibility with "Claude 4.0 Sonnet" and the Apache License. If this were a clean room PR, there would be no issue of course. |
Understood! I suppose we will see what they say. The usage here is minimal, but appreciate that the lines are somewhat blurred. If that is not acceptable, what would be the best path forward? Rewrite manually? Shouldn't really take that long, the implementation is straightforward enough. |
|
Hi @LorgeN Yes, I think the best path forward is a clean room implementation. Beyond that, I think it would help to have test data from known sources documented, even if it's just from Wikipedia. Maybe there is an original paper that has examples? Ty! |
Ack, makes sense. Will see what I can find! |
f6578ff to
f742a54
Compare
|
Hey again @garydgregory! Thanks for all your help here. I've rewritten the implementation without using any LLMs, and increased the test coverage. I was however not able to find any good examples of values anywhere. Wikipedia has nothing, and the papers all quote aggregate statistics (or use generated samples where they insert "mistakes" using a dictionary of words). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your updates. Looks reasonable to me aside from the builds failing 😉
You MUST run mvn by itself (no args) before pushing to catch all build failures
TY.
@ppkarwasz Any thoughts?
src/main/java/org/apache/commons/text/similarity/DamerauLevenshteinDistance.java
Outdated
Show resolved
Hide resolved
|
Hey again @garydgregory! Apologies for the delay here. Everything runs successfully locally now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@ppkarwasz Any thoughts?
|
Hey! Just checking in, @garydgregory. Anything you need from me to get this merged? Appreciate the help here :) |
|
Hello @LorgeN I am looking for advice in https://issues.apache.org/jira/browse/LEGAL-709 |
|
Hey @garydgregory, just wanted to check in again to see if there is anything I can do to help speed this up? |
- Update changes.xml - Use final - Add Javadoc - Sort members - Reduce vertical whitespace - Remove extra parentheses
|
Hello @LorgeN Merged 🚀 |
|
Too late to ask, but were there any consideration to have a SIMD implementation of the (Damerau-)Levenshtein distance? (thinking of https://github.com/Turnerj/Quickenshtein/blob/main/src/Quickenshtein/Levenshtein.Intrinsics.cs for example) |
Adds Damerau-Levenshtein distance as a measure. Implementation based on original Levenshtein distance implementation, and Wikipedia. Unit tests written by comparing various strings manually, and by comparing to the value outputted by RapidFuzz.
Relevant Jira ticket; https://issues.apache.org/jira/browse/TEXT-235