英文:
Finding which error(s) are detected by Damerau-Levenshtein edit distance algorithm
问题
以下是您提供的代码的翻译部分:
我正在创建一个拼写更正工具,并希望使用贝叶斯定理实现一个带有噪声通道的模型。为了做到这一点,我需要计算概率 P(X|W),其中 X 是给定的(拼写错误的)单词,W 是可能的更正。该概率是通过从混淆矩阵获取一个值来获得的,这取决于知道发生了哪种类型的错误,这意味着如果例如 X = "egh",W = "egg",那么编辑距离将为 1,错误将是发生在第 2 个字符上的替换错误。
我正在尝试找到一种方法来获取错误的 "类型" 以及它发生的字符,但似乎无法使其正常工作。
我尝试过创建一个 TreeMap,并在检测到错误时插入 i/j 值,但这没有起作用。
我可以假设只有一个错误,这意味着编辑距离恰好为 1。
以下是我的代码:
public static int DLD(String s1, String s2) {
// ...(此处省略部分代码)
return distances[s1.length() + 1][s2.length() + 1];
}
Edit 1:
我找到了一些解决方法,似乎能够正常工作,尽管我不能百分之百确定。我将使用 min() 方法的代码段替换为以下内容:
int sub = distances[i][j] + cost;
int ins = distances[i + 1][j] + 1;
int del = distances[i][j + 1] + 1;
int trans = distances[i1][j1] + (i - i1 - 1) + 1 + (j - j1 - 1);
distances[i + 1][j + 1] = min(sub, ins, del, trans);
if ((distances[i][j] == 0 || distances[i - 1][j] == 0 ||
distances[i][j - 1] == 0 || distances[i + 1][j + 1] == trans) &&
distances[i + 1][j + 1] == 1) {
TreeMap<String, Integer> error = mappingTermAndError.getOrDefault(s2, null);
if (error != null) {
error.clear();
} else {
error = new TreeMap<>();
}
if (distances[i + 1][j + 1] == trans) {
error.put("trans", i - 2);
} else if (distances[i + 1][j + 1] == del) {
error.put("del", i - 1);
} else if (distances[i + 1][j + 1] == ins) {
error.put("ins", i - 1);
} else { // distances[i + 1][j + 1] == sub
error.put("sub", i - 1);
}
mappingTermAndError.put(s2, error);
}
它的基本思想是获取每种错误类型的值,然后计算最小值。如果新的最小值为 1(因此这是第一个错误),并且先前的距离矩阵中的一个单元格为 0(意味着存在一条没有错误的路径通往该点),或者如果错误是转置(我们只能在已经有错误后才能知道),那么我将使用新错误替换先前注册的错误,并获取与发生错误字符相对应的 "i"。
我意识到这个解决方案相当丑陋,可能不是很高效,所以如果有人对如何改进它有任何想法,那将非常棒。
<details>
<summary>英文:</summary>
I'm creating a spelling correction tool and wanted to implement a noisy channel with Bayes theorem. In order to do so, I need to calculate the probability P(X|W), where X is the given (misspelled) word, and W is the possible correction. The probability is given by getting a value from a confusion matrix, that depends on knowing which type of error happened, meaning that if for example X = "egh" and W = "egg" then the edit distance would be 1, and the error would be a substitution error that happened on character number 2.
I'm trying to find a way to get the error "type" as well as the character it happened for, but can't seem to make it work.
I've tried creating a TreeMap and inserting i/j values whenever an error is detected, but it didn't work.
I may assume that there's only one error, meaning that the edit distance is exactly 1.
Here's my code:
public static int DLD(String s1, String s2) {
if (s1 == null || s2 == null) { // Invalid input
return -1;
}
if (s1.equals(s2)) { // No distance to compute
return 0;
}
// The max possible distance
int inf = s1.length() + s2.length();
// Create and initialize the character array indices
HashMap<Character, Integer> da = new HashMap<>();
for (int i = 0; i < s1.length(); ++i) {
da.put(s1.charAt(i), 0);
}
for (int j = 0; j < s2.length(); ++j) {
da.put(s2.charAt(j), 0);
}
// Create the distance matrix H[0 .. s1.length+1][0 .. s2.length+1]
int[][] distances = new int[s1.length() + 2][s2.length() + 2];
// initialize the left and top edges of H
for (int i = 0; i <= s1.length(); ++i) {
distances[i + 1][0] = inf;
distances[i + 1][1] = i;
}
for (int j = 0; j <= s2.length(); ++j) {
distances[0][j + 1] = inf;
distances[1][j + 1] = j;
}
// fill in the distance matrix H
// look at each character in s1
for (int i = 1; i <= s1.length(); ++i) {
int db = 0;
// look at each character in s2
for (int j = 1; j <= s2.length(); ++j) {
int i1 = da.get(s2.charAt(j - 1));
int j1 = db;
int cost = 1;
if (s1.charAt(i - 1) == s2.charAt(j - 1)) {
cost = 0;
db = j;
}
distances[i + 1][j + 1] = min(
distances[i][j] + cost, // substitution
distances[i + 1][j] + 1, // insertion
distances[i][j + 1] + 1, // deletion
distances[i1][j1] + (i - i1 - 1) + 1 + (j - j1 - 1));
}
da.put(s1.charAt(i - 1), i);
}
return distances[s1.length() + 1][s2.length() + 1];
}
Any hint/direction towards solving this would be much appreciated.
Thanks!
***Edit 1:***
I figured something out and it seems to be working, although I'm not 100% sure. I replaced the code segment where I use the min() method with this:
int sub = distances[i][j] + cost;
int ins = distances[i + 1][j] + 1;
int del = distances[i][j + 1] + 1;
int trans = distances[i1][j1] + (i - i1 - 1) + 1 + (j - j1 - 1);
distances[i + 1][j + 1] = min(sub, ins, del, trans);
if ((distances[i][j] == 0 || distances[i - 1][j] == 0 ||
distances[i][j - 1] == 0 || distances[i + 1][j + 1] == trans) &&
distances[i + 1][j + 1] == 1) {
TreeMap<String, Integer> error = mappingTermAndError.getOrDefault(s2, null);
if (error != null) {
error.clear();
} else {
error = new TreeMap<>();
}
if (distances[i + 1][j + 1] == trans) {
error.put("trans", i - 2);
} else if (distances[i + 1][j + 1] == del) {
error.put("del", i - 1);
} else if (distances[i + 1][j + 1] == ins) {
error.put("ins", i - 1);
} else { // distances[i + 1][j + 1] == sub
error.put("sub", i - 1);
}
mappingTermAndError.put(s2, error);
}
What it basically does is get the value for each error type, then calculate the minimum.
if The new minimum is 1 (so this is the first error) and also one of the previous cells in the distance matrix is 0 (meaning there's a path with no errors leading to that point) or if the error is transposition (which we can only know about after we've already had an error) than I replace the previously registered error with the new one, and get the 'i' corresponding with the character the error was done for.
I'm aware that this solution is pretty ugly and probably not very efficient, so if someone has any thoughts on how to improve that it would be great.
</details>
# 答案1
**得分**: 2
错误类型和涉及的字符必须存储在某处。您可以将它们存储在单独的数据结构中,或者将它们封装在对象中。
以下是使用对象可能的实现方式。出于简单起见,我只实现了Levenshtein距离,但我确信您可以轻松地将该技术应用于Damerau-Levenshtein。
首先,您需要定义一个类,封装了有关编辑的信息:成本、父节点以及任何额外信息,如类型(替换、插入、删除)或涉及的字符。为了保持简单,我使用一个名为"type"的单个字符串来表示此额外信息,但您可能希望为错误类型、字符索引等添加单独的字段。甚至可以使用继承来创建具有不同行为的不同编辑子类型。
```java
class Edit implements Comparable<Edit> {
int cost;
Edit parent;
String type;
public Edit() {
// 创建一个没有父节点且成本为零的“起始”节点
}
public Edit(String type, Edit parent, int cost) {
this.type = type;
this.cost = parent.cost + cost;
this.parent = parent;
}
@Override
public int compareTo(Edit o) {
return Integer.compare(this.cost, o.cost);
}
@Override
public String toString() {
return type;
}
}
然后,您可以在距离表中使用此类,而不仅仅是使用int
。在0,0处有一个特殊的起始节点,没有父节点。在所有其他点上,您根据到达该节点所需的最小成本选择一个具有一个父节点或另一个父节点的节点。为了更加灵活,让我们将矩阵的构建拆分出来,不在editDistance
方法中完成:
Edit[][] buildMatrix(String s1, String s2) {
Edit[][] distance = new Edit[s1.length() + 1][s2.length() + 1];
distance[0][0] = new Edit();
for (int i = 1; i <= s1.length(); i++) {
distance[i][0] = new Edit("-" + s1.charAt(i - 1), distance[i - 1][0], 1);
}
for (int j = 1; j <= s2.length(); j++) {
distance[0][j] = new Edit("+" + s2.charAt(j - 1), distance[0][j - 1], 1);
}
for (int i = 1; i <= s1.length(); i++) {
for (int j = 1; j <= s2.length(); j++) {
int replaceCost = s1.charAt(i - 1) == s2.charAt(j - 1) ? 0 : 1;
distance[i][j] = Collections.min(List.of(
// 替换或相同
new Edit(s1.charAt(i - 1) + "/" + s2.charAt(j - 1), distance[i - 1][j - 1], replaceCost),
// 删除
new Edit("-" + s1.charAt(i - 1), distance[i - 1][j], 1),
// 插入
new Edit("+" + s2.charAt(j - 1), distance[i][j - 1], 1)));
}
}
return distance;
}
然后,“编辑距离”函数只需要获取最后一个节点的成本:
int editDistance(String s1, String s2) {
Edit[][] distance = buildMatrix(s1, s2);
return distance[s1.length()][s2.length()].cost;
}
但由于“parent”指针的存在,您还可以轻松地构建需要将一个字符串更改为另一个字符串的编辑列表,也称为“差异”:
List<Edit> diff(String s1, String s2) {
Edit[][] distance = buildMatrix(s1, s2);
List<Edit> diff = new ArrayList<>();
Edit edit = distance[s1.length()][s2.length()];
while (edit != distance[0][0]) {
diff.add(edit);
edit = edit.parent;
}
Collections.reverse(diff);
return diff;
}
希望这对您有所帮助。
英文:
The error type and characters involved have to be stored somewhere. You can have them in separate data structures, or you can have them in encapsulated in objects.
Here's what it could look like using objects. For simplicity I'm implementing only Levenshtein distance, but I'm sure you can easily apply the technique to Damerau–Levenshtein.
First you need to define a class that encapsulates the information about an edit: cost, parent, and any extra information like type (replace, insert, delete) or the characters involved. To keep things simple I'm keeping a single string called "type" for this extra info, but you would want to add separate fields for the type of error, the character indices, etc. You may even want to use inheritance to create different subtypes of edits with different behavior.
class Edit implements Comparable<Edit> {
int cost;
Edit parent;
String type;
public Edit() {
// create a "start" node with no parent and zero cost
}
public Edit(String type, Edit parent, int cost) {
this.type = type;
this.cost = parent.cost + cost;
this.parent = parent;
}
@Override
public int compareTo(Edit o) {
return Integer.compare(this.cost, o.cost);
}
@Override
public String toString() {
return type;
}
}
Then you use this class instead of just int
for the distance table. At 0,0 there is a special start node with no parent. At all other points you choose a node with one parent or another according to the minimum cost it takes to arrive at that node. To be more flexible, let's split out the building of the matrix out of the editDistance method:
Edit[][] buildMatrix(String s1, String s2) {
Edit[][] distance = new Edit[s1.length() + 1][s2.length() + 1];
distance[0][0] = new Edit();
for (int i = 1; i <= s1.length(); i++) {
distance[i][0] = new Edit("-" + s1.charAt(i - 1), distance[i - 1][0], 1);
}
for (int j = 1; j <= s2.length(); j++) {
distance[0][j] = new Edit("+" + s2.charAt(j - 1), distance[0][j - 1], 1);
}
for (int i = 1; i <= s1.length(); i++) {
for (int j = 1; j <= s2.length(); j++) {
int replaceCost = s1.charAt(i - 1) == s2.charAt(j - 1) ? 0 : 1;
distance[i][j] = Collections.min(List.of(
// replace or same
new Edit(s1.charAt(i - 1) + "/" + s2.charAt(j - 1), distance[i - 1][j - 1], replaceCost),
// delete
new Edit("-" + s1.charAt(i - 1), distance[i - 1][j], 1),
// insert
new Edit("+" + s2.charAt(j - 1), distance[i][j - 1], 1)));
}
}
return distance;
}
Then the "edit distance" function only needs to take the cost of the last node:
int editDistance(String s1, String s2) {
Edit[][] distance = buildMatrix(s1, s2);
return distance[s1.length()][s2.length()].cost;
}
But thanks to the "parent" pointers, you can also easily construct the list of edits needed to change one string to the other, also known as a "diff":
List<Edit> diff(String s1, String s2) {
Edit[][] distance = buildMatrix(s1, s2);
List<Edit> diff = new ArrayList<>();
Edit edit = distance[s1.length()][s2.length()];
while (edit != distance[0][0]) {
diff.add(edit);
edit = edit.parent;
}
Collections.reverse(diff);
return diff;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论