2020年9月29日 13:50:03go评论80阅读模式

英文:

How to normalize Levenshtein distance between 0 to 1

问题

以下是翻译好的内容：

我必须将Levenshtein距离归一化到0到1之间。我在Stack Overflow上看到了不同的变体。

我考虑采用以下方法：

如果有两个字符串，s1和s2
len = max(s1.length(), s2.length());
normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);

然后最高得分1.0表示完全匹配，0.0表示没有匹配。

但是我在这里看到了不同的变体：
https://stackoverflow.com/questions/15612475/two-whole-texts-similarity-using-levenshtein-distance 其中使用了 1 - distance(a, b) / max(a.length, b.length)

https://stackoverflow.com/questions/41066394/difference-in-normalization-of-levenshtein-edit-distance

https://stackoverflow.com/questions/30787098/explanation-of-normalized-edit-distance-formula

我想知道是否有Java中的规范代码实现？我知道org.apache.commons.text只实现了LevenshteinDistance，没有实现归一化的LevenshteinDistance。

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

英文:

I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.

I am thinking to adopt the following approach:

if two strings, s1 and s2
len = max(s1.length(), s2.length());
normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);

Then the highest score 1.0 means an exact match and 0.0 means no match.

But I see variations here:
https://stackoverflow.com/questions/15612475/two-whole-texts-similarity-using-levenshtein-distance where 1- distance(a,b)/max(a.length, b.length)

https://stackoverflow.com/questions/41066394/difference-in-normalization-of-levenshtein-edit-distance

https://stackoverflow.com/questions/30787098/explanation-of-normalized-edit-distance-formula

I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text only implements LevenshteinDistance and not normalized LevenshteinDistance.

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

答案1

得分: 2

你的第一个回答以"这两个变体的影响应该几乎相同"开头。规范化Levenshtein距离之所以不存在，是因为你（或其他人）认为没有必要实现它。此外，一旦你拥有了Levenshtein距离，这似乎是一个相当琐碎的问题：

private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    if (s1.length() >= s2.length()) {
        return levenshtein / s1.length();
    }
    else {
        return levenshtein / s2.length();
    }
}

经过3天的充分讨论后，如果这段代码被彻底修改，我将在commons-text的Github问题中添加它。

英文:

Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:

private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    if (s1.length() &gt;= s2.length()) {
        return levenshtein / s1.length();
    }
    else {
        return levenshtein / s2.length();
    }
}

After 3 days, once this has been thoroughly ripped to shreds, I'll add it as a Github issue on commons-text.

答案2

得分: 1

似乎你需要一种相似性度量，而不是实际的距离度量。

一个适当的距离度量应该遵守度量的规则，就像Commons Text接口EditDistance的Javadoc所说。Commons Text没有包含规范化Levenshtein距离的实现是有原因的。虽然可以正确地执行，但我怀疑结果是否有用。

然而，像你建议的那样，使用Levenshtein距离来定义相似性度量是可行的。

Apache Commons Text已经有一些用于测量相似性的实现。也许JaroWinklerSimilarity会合适。

我会考虑编写一个使用Levenshtein距离的SimilarityScore接口的实现，就像你建议的那样。它会产生略微不同于JaroWinklerSimilarity的结果。使用该接口进行自己的实现可以轻松地更改为Commons Text提供的任何实现。你可以轻松地比较不同的算法。

只要确保在检查max(s1.length, s2.length)不为零之前不要除以它！

英文:

It seems you need a measure of similarity rather than an actual measure of distance.

A proper measure of distance should obey the rules of metric like the Javadoc of the interface EditDistance in Commons Text says. There is a reason Commons Text does not include an implementation for normalized Levenshtein distance. It can be done properly, but I doubt the results would be useful.

However, using Levenshtein distance to define a measure of similarity like you suggested will work.

Apache Commons Text already has some implementations for measuring similarity. Perhaps JaroWinklerSimilarity would fit the bill.

I'd consider writing an implementation for the SimilarityScore interface using Levenshtein distance like you suggested. It will produce slightly different results than JaroWinklerSimilarity. Using the interface for your own implementation would allow changing it easily to any implementation provided by Commons Text. You could easily compare different algorithms.

Just make sure you don't divide with max(s1.length, s2.length) before checking it's not zero!

答案3

得分: 1

我使用了标准化编辑距离或相似性（NES），我认为这非常有用，由Daniel Lopresti和Jiangyin Zhou在他们的论文中的方程（6）中进行了定义：http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf。

Python中的NES如下：

import math
def normalized_edit_similarity(m, d):
    # d：两个字符串之间的编辑距离
    # m：较短字符串的长度
    return (1.0 / math.exp(d / (m - d)))

print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))

输出结果：

1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592

更多示例可以在上述论文的表2中找到。

上述函数中的变量m可以替换为较长字符串的长度，以满足您的需求。

另请参阅：https://stackoverflow.com/a/71266201/8583170（我尚未熟悉如何用相同的答案回答类似的问题）。

英文:

I had used a normalized edit distance or similarity (NES) which I think is very useful, defined by Daniel Lopresti and Jiangyin Zhou, in Equation (6) of their work: http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf.

The NES in python is:

import math
def normalized_edit_similarity(m, d):
    # d : edit distance between the two strings
    # m : length of the shorter string
    return ( 1.0 / math.exp( d / (m - d) ) )

print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))

1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592

More examples can be found in Table 2 in the above paper.

The variable m in the above function can be replaced with the length of the longer string to fit your need.

See also: https://stackoverflow.com/a/71266201/8583170 (I have not yet familiar with how to answer similar questions with the same answer).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将Levenshtein距离归一化到0到1之间

问题

答案1

答案2

答案3

Android的Retrofit预期是BEING_OBJECT，但实际在第1行第1列却是字符串。

通过与ID字段不同的字段映射对象的相关实体

双精度 vs Eclipse 中的双精度浮点数(Double)

检查一个ArrayList是否包含两个字符串。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论