Skip to content

Conversation

@LorgeN
Copy link
Contributor

@LorgeN LorgeN commented Aug 5, 2025

Adds Damerau-Levenshtein distance as a measure. Implementation based on original Levenshtein distance implementation, and Wikipedia. Unit tests written by comparing various strings manually, and by comparing to the value outputted by RapidFuzz.

Relevant Jira ticket; https://issues.apache.org/jira/browse/TEXT-235

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @LorgeN

Thank you for your PR.

It looks like this PR decreases the code coverage. You can run mvn clean package site and look at the JaCoCo report in target\site and go from there to increase the coverage.

You can also run mvn (solo) to run all build checks.

@garydgregory
Copy link
Member

garydgregory commented Aug 5, 2025

I don't know if we can accept this PR due to possible provenance incompatibility with "Claude 4.0 Sonnet" and the Apache License. If this were a clean room PR, there would be no issue of course.

FTR https://issues.apache.org/jira/browse/LEGAL-709

@LorgeN
Copy link
Contributor Author

LorgeN commented Aug 5, 2025

I don't know if we can accept this PR due to possible provenance incompatibility with "Claude 4.0 Sonnet" and the Apache License. If this were a clean room PR, there would be no issue of course.

FTR https://issues.apache.org/jira/browse/LEGAL-709

Understood! I suppose we will see what they say. The usage here is minimal, but appreciate that the lines are somewhat blurred.

If that is not acceptable, what would be the best path forward? Rewrite manually? Shouldn't really take that long, the implementation is straightforward enough.

@garydgregory
Copy link
Member

garydgregory commented Aug 5, 2025

Hi @LorgeN

Yes, I think the best path forward is a clean room implementation.

Beyond that, I think it would help to have test data from known sources documented, even if it's just from Wikipedia. Maybe there is an original paper that has examples?

Ty!

@LorgeN
Copy link
Contributor Author

LorgeN commented Aug 5, 2025

Hi @LorgeN

Yes, I think the best path forward is a clean room implementation.

Beyond that, I thibk it would help to have test data from known sources documented, even if it's just from Wikipedia. Maybe there is an original paper that has examples?

Ty!

Ack, makes sense. Will see what I can find!

@LorgeN LorgeN force-pushed the eiriklt/damerau-levenshtein-distance branch from f6578ff to f742a54 Compare August 5, 2025 17:24
@LorgeN
Copy link
Contributor Author

LorgeN commented Aug 5, 2025

Hey again @garydgregory! Thanks for all your help here.

I've rewritten the implementation without using any LLMs, and increased the test coverage. I was however not able to find any good examples of values anywhere. Wikipedia has nothing, and the papers all quote aggregate statistics (or use generated samples where they insert "mistakes" using a dictionary of words).

@LorgeN LorgeN requested a review from garydgregory August 7, 2025 09:22
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LorgeN

Thank you for your updates. Looks reasonable to me aside from the builds failing 😉

You MUST run mvn by itself (no args) before pushing to catch all build failures

TY.

@ppkarwasz Any thoughts?

@LorgeN
Copy link
Contributor Author

LorgeN commented Sep 19, 2025

Hey again @garydgregory! Apologies for the delay here. Everything runs successfully locally now :)

@LorgeN LorgeN requested a review from garydgregory September 19, 2025 16:09
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ppkarwasz Any thoughts?

@LorgeN
Copy link
Contributor Author

LorgeN commented Oct 6, 2025

Hey! Just checking in, @garydgregory. Anything you need from me to get this merged?

Appreciate the help here :)

@garydgregory
Copy link
Member

garydgregory commented Oct 6, 2025

Hello @LorgeN

I am looking for advice in https://issues.apache.org/jira/browse/LEGAL-709

@LorgeN
Copy link
Contributor Author

LorgeN commented Oct 29, 2025

Hey @garydgregory, just wanted to check in again to see if there is anything I can do to help speed this up?

@garydgregory
Copy link
Member

Hello @LorgeN

Thank you for your patience.

As you can see this comment from a couple of weeks ago, I asked for another review but got zero feedback. I'll merge the PR very soon, when I sit down for my next FOSS session.

@garydgregory garydgregory merged commit 79f036a into apache:master Oct 30, 2025
9 checks passed
garydgregory added a commit that referenced this pull request Oct 30, 2025
- Update changes.xml
- Use final
- Add Javadoc
- Sort members
- Reduce vertical whitespace
- Remove extra parentheses
@garydgregory
Copy link
Member

Hello @LorgeN

Merged 🚀

@Jiehong
Copy link

Jiehong commented Nov 5, 2025

Too late to ask, but were there any consideration to have a SIMD implementation of the (Damerau-)Levenshtein distance?

(thinking of https://github.com/Turnerj/Quickenshtein/blob/main/src/Quickenshtein/Levenshtein.Intrinsics.cs for example)

@garydgregory
Copy link
Member

Hello @Jiehong

That's a question for @LorgeN, but feel free to open a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants