如何将Levenshtein距离归一化到0到1之间

huangapple go评论76阅读模式
英文:

How to normalize Levenshtein distance between 0 to 1

问题

以下是翻译好的内容:

我必须将Levenshtein距离归一化到0到1之间。我在Stack Overflow上看到了不同的变体。

我考虑采用以下方法:

  • 如果有两个字符串,s1和s2
  • len = max(s1.length(), s2.length());
  • normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);

然后最高得分1.0表示完全匹配,0.0表示没有匹配。

但是我在这里看到了不同的变体:
https://stackoverflow.com/questions/15612475/two-whole-texts-similarity-using-levenshtein-distance 其中使用了 1 - distance(a, b) / max(a.length, b.length)

https://stackoverflow.com/questions/41066394/difference-in-normalization-of-levenshtein-edit-distance

https://stackoverflow.com/questions/30787098/explanation-of-normalized-edit-distance-formula

我想知道是否有Java中的规范代码实现?我知道org.apache.commons.text只实现了LevenshteinDistance,没有实现归一化的LevenshteinDistance。

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

英文:

I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.

I am thinking to adopt the following approach:

  • if two strings, s1 and s2
  • len = max(s1.length(), s2.length());
  • normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);

Then the highest score 1.0 means an exact match and 0.0 means no match.

But I see variations here:
https://stackoverflow.com/questions/15612475/two-whole-texts-similarity-using-levenshtein-distance where 1- distance(a,b)/max(a.length, b.length)

https://stackoverflow.com/questions/41066394/difference-in-normalization-of-levenshtein-edit-distance

https://stackoverflow.com/questions/30787098/explanation-of-normalized-edit-distance-formula

I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text only implements LevenshteinDistance and not normalized LevenshteinDistance.

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

答案1

得分: 2

你的第一个回答以"这两个变体的影响应该几乎相同"开头。规范化Levenshtein距离之所以不存在,是因为你(或其他人)认为没有必要实现它。此外,一旦你拥有了Levenshtein距离,这似乎是一个相当琐碎的问题:

private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    if (s1.length() >= s2.length()) {
        return levenshtein / s1.length();
    }
    else {
        return levenshtein / s2.length();
    }
}

经过3天的充分讨论后,如果这段代码被彻底修改,我将在commons-text的Github问题中添加它。

英文:

Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:

private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    if (s1.length() >= s2.length()) {
        return levenshtein / s1.length();
    }
    else {
        return levenshtein / s2.length();
    }
}

After 3 days, once this has been thoroughly ripped to shreds, I'll add it as a Github issue on commons-text.

答案2

得分: 1

似乎你需要一种相似性度量,而不是实际的距离度量。

一个适当的距离度量应该遵守度量的规则,就像Commons Text接口EditDistance的Javadoc所说。Commons Text没有包含规范化Levenshtein距离的实现是有原因的。虽然可以正确地执行,但我怀疑结果是否有用。

然而,像你建议的那样,使用Levenshtein距离来定义相似性度量是可行的。

Apache Commons Text已经有一些用于测量相似性的实现。也许JaroWinklerSimilarity会合适。

我会考虑编写一个使用Levenshtein距离的SimilarityScore接口的实现,就像你建议的那样。它会产生略微不同于JaroWinklerSimilarity的结果。使用该接口进行自己的实现可以轻松地更改为Commons Text提供的任何实现。你可以轻松地比较不同的算法。

只要确保在检查max(s1.length, s2.length)不为零之前不要除以它!

英文:

It seems you need a measure of similarity rather than an actual measure of distance.

A proper measure of distance should obey the rules of metric like the Javadoc of the interface EditDistance in Commons Text says. There is a reason Commons Text does not include an implementation for normalized Levenshtein distance. It can be done properly, but I doubt the results would be useful.

However, using Levenshtein distance to define a measure of similarity like you suggested will work.

Apache Commons Text already has some implementations for measuring similarity. Perhaps JaroWinklerSimilarity would fit the bill.

I'd consider writing an implementation for the SimilarityScore interface using Levenshtein distance like you suggested. It will produce slightly different results than JaroWinklerSimilarity. Using the interface for your own implementation would allow changing it easily to any implementation provided by Commons Text. You could easily compare different algorithms.

Just make sure you don't divide with max(s1.length, s2.length) before checking it's not zero!

答案3

得分: 1

我使用了标准化编辑距离或相似性(NES),我认为这非常有用,由Daniel Lopresti和Jiangyin Zhou在他们的论文中的方程(6)中进行了定义:http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf。

Python中的NES如下:

import math
def normalized_edit_similarity(m, d):
    # d:两个字符串之间的编辑距离
    # m:较短字符串的长度
    return (1.0 / math.exp(d / (m - d)))

print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))

输出结果:

1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592

更多示例可以在上述论文的表2中找到。

上述函数中的变量m可以替换为较长字符串的长度,以满足您的需求。

另请参阅:https://stackoverflow.com/a/71266201/8583170(我尚未熟悉如何用相同的答案回答类似的问题)。

英文:

I had used a normalized edit distance or similarity (NES) which I think is very useful, defined by Daniel Lopresti and Jiangyin Zhou, in Equation (6) of their work: http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf.

The NES in python is:

import math
def normalized_edit_similarity(m, d):
    # d : edit distance between the two strings
    # m : length of the shorter string
    return ( 1.0 / math.exp( d / (m - d) ) )

print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))

1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592

More examples can be found in Table 2 in the above paper.

The variable m in the above function can be replaced with the length of the longer string to fit your need.

See also: https://stackoverflow.com/a/71266201/8583170 (I have not yet familiar with how to answer similar questions with the same answer).

huangapple
  • 本文由 发表于 2020年9月29日 13:50:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/64113621.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定