2020年1月7日 00:36:18go评论78阅读模式

英文:

High Precision Word Alignment Algorithm in Python

问题

我正在进行一个项目，旨在构建句子及其在其他语言中的翻译之间的高精度单词对齐，以衡量翻译质量。我了解到Giza++和其他用作统计机器翻译流程一部分的单词对齐工具，但这不是我正在寻找的。我正在寻找一种算法，可以在满足以下限制的情况下将源句子中的单词映射到目标句子中的相应单词，透明且准确：

两种语言的词序不相同，且顺序不断变化
源句子中的某些单词没有对应的目标句子中的单词，反之亦然
有时源句中的一个单词对应于目标句中的多个单词，反之亦然，可能存在多对多的映射
可能存在句子，其中同一个单词在句子中多次使用，因此对齐需要考虑单词及其索引，而不仅仅是单词

这是我所做的事情：

从一组句子对开始，例如英语-德语，每个句子都被标记为单词
索引每个句子中的所有单词，并为每个单词创建一个反向索引（例如，单词“world”出现在句子# 5、16、19、26 ...等），对于源和目标单词都这样做
现在，这个反向索引可以预测任何源单词与任何目标单词之间的关联性，如两个单词之间的交集除以它们的并集。例如，如果目标单词“Welt”出现在句子5、16、26、32中，那么（world, Welt）之间的关联性是交集中的索引数量（3）除以并集中的索引数量（5），因此关联性为0.6。使用并集会使高频单词（例如“the”）及其它语言中的相应单词的关联性较低
再次遍历所有句子对，并使用给定句子对的源和目标单词的索引来创建一个关联矩阵

这是英语和德语句子之间的关联矩阵的示例。我们可以看到上面讨论的挑战。

在图像中，有一个英语和德语句子之间的对齐示例，显示单词之间的关联，绿色单元格是单词对齐算法应该识别的正确对齐点。

这是我尝试过的一些方法：

在某些情况下，预期的对齐可能只是在其各自列和行中具有最高关联的单词对，但在许多情况下不是这样。
我尝试过使用Dijkstra的算法来绘制连接对齐点的路径，但似乎这种方式行不通，因为由于词序的原因，您可以来回跳转到先前的句子中的单词，并且没有合理的方式来跳过没有对齐的单词。
我认为最佳解决方案将涉及从最可能的对应关系开始扩展矩形，跨越多对多的对应关系，并跳过没有对齐的单词，但我不确定如何实现这一点。

这是我正在使用的代码：

import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
    return random.random() #adjust this to get the actual correlation value

all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src  indexes
    src_word=src_words[i] #identify the correponding src word
    for j in range(len(trg_words)): #iterate over trg indexes
        trg_word=trg_words[j] #identify the correponding trg word
        val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of     each word (or from the data provided in the speadsheet)
        all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val

all_pairs_vals.sort(key=lambda x:-x[-1])  #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
    if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
    if j0 in used_j: continue #same if the current row was used before
    selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
    used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
    used_j.append(j0)

for a in all_pairs_vals: #list all pairs and indicate which ones were selected
    i0,j0,val0=a
    if (i0,j0) in selected_alignments: print(a, "<<<")
    else: print(a)

这个代码存在问题，因为它不能适应多对多甚至一对多的对齐，而且容易在一开始选择具有最高关联的错误对，从而在将来的选择中

英文:

I am working on a project for building a high precision word alignment between sentences and their translations in other languages, for measuring translation quality. I am aware of Giza++ and other word alignment tools that are used as part of the pipeline for Statistical Machine Translation, but this is not what I'm looking for. I'm looking for an algorithm that can map words from the source sentence into the corresponding words in the target sentence, transparently and accurately given these restrictions:

the two languages do not have the same word order, and the order keeps changing
some words in the source sentence do not have corresponding words in the target sentence, and vice versa
sometimes a word in the source correspond to multiple words in the target, and vice versa, and there can be many-to-many mapping
there can be sentences where the same word is used multiple times in the sentence, so the alignment needs to be done with the words and their indexes, not only words

Here is what I did:

Start with a list of sentence pairs, say English-German, with each sentence tokenized to words
Index all words in each sentence, and create an inverted index for each word (e.g. the word "world" occurred in sentences # 5, 16, 19, 26 ... etc), for both source and target words
Now this inverted index can predict the correlation between any source word and any target word, as the intersection between the two words divided by their union. For example, if the tagret word "Welt" occurs in sentences 5, 16, 26,32, The correlation between (world, Welt) is the number of indexes in the intersection (3) divided by the number of indexes in the union (5), and hence the correlation is 0.6. Using the union gives lower correlation with high frequency words, such as "the", and the corresponding words in other languages
Iterate over all sentence pairs again, and use the indexes for the source and target words for a given sentence pairs to create a correlation matrix

Here is an example of a correlation matrix between an English and a German sentence. We can see the challenges discussed above.

In the image, there is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm.

Here is some of what I tried:

It is possible in some cases that the intended alignment is simply the word pair with the highest correlation in its respective column and row, but in many cases it's not.
I have tried things like Dijkstra's algorithm to draw a path connecting the alignment points, but it doesn't seem to work this way, because it seems you can jump back and forth to earlier words in the sentence because of the word order, and there is no sensible way to skip words for which there is no alignment.
I think the optimum solution will involve something
like expanding rectangles which start from the most likely
correspondences, and span many-to-many correspondences, and skip
words with no alignment, but I'm not exactly sure what would be a
good way to implement this

Here is the code I am using:

import random
src_words=[&quot;I&quot;,&quot;know&quot;,&quot;this&quot;]
trg_words=[&quot;Ich&quot;,&quot;kenne&quot;,&quot;das&quot;]
def match_indexes(word1,word2):
    return random.random() #adjust this to get the actual correlation value

all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src  indexes
    src_word=src_words[i] #identify the correponding src word
    for j in range(len(trg_words)): #iterate over trg indexes
        trg_word=trg_words[j] #identify the correponding trg word
        val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of     each word (or from the data provided in the speadsheet)
        all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val

all_pairs_vals.sort(key=lambda x:-x[-1])  #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
    if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
    if j0 in used_j: continue #same if the current row was used before
    selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
    used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
    used_j.append(j0)

for a in all_pairs_vals: #list all pairs and indicate which ones were selected
    i0,j0,val0=a
    if (i0,j0) in selected_alignments: print(a, &quot;&lt;&lt;&lt;&lt;&quot;)
    else: print(a)

It's problematic because it doesn't accomodate the many-to-many, or even the one to many alignments, and can err easily in the beginning by selecting a wrong pair with highest correlation, excluding its row and column from future selection. A good algorithm would factor in that a certain pair has the highest correlation in its respective row/column, but would also consider the proximity to other pairs with high correlations.

Here is some data to try if you like, it's in Google sheets:
https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing

答案1

得分: 6

词语对齐仍然在某种程度上是一个开放性的研究课题。Giza++背后的概率模型相当复杂，详见：http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf

有很多现有的方法可以尝试，比如：

实现Giza++使用的“IBM模型”（或者如果你勇敢的话，尝试NLTK的实现）
实现fast_align背后的（要简单得多的）算法 https://www.aclweb.org/anthology/N13-1073/
实现一些基于HMM的对齐方法 https://www.aclweb.org/anthology/C96-2141/
使用深度学习，有多种可能性；这篇论文似乎包含了一些方法的概述 https://www.aclweb.org/anthology/P19-1124.pdf（通常人们会尝试利用神经机器翻译模型的注意力机制来进行词语对齐）

这是一个非常困难的机器学习问题，虽然不排除像您提出的简单方法可能会奏效，但首先研究一下已有的工作可能是个好主意。尽管如此，在这个领域我们已经看到了一些出人意料的简单技术取得了突破，所以谁知道呢

英文:

Word alignment remains an open research topic to some extent. The probabilistic models behind Giza++ are fairly non-trivial, see: http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf

There is a lot of existing approaches you could take, such as:

implement the "IBM models" used by Giza++ yourself (or if you're brave, try the NLTK implementation)
implement the (much much simpler) algorithm behind fast_align https://www.aclweb.org/anthology/N13-1073/
implement some form of HMM-based alignment https://www.aclweb.org/anthology/C96-2141/
use deep learning, there are multiple possibilities there; this paper seems to contain a nice overview of approaches https://www.aclweb.org/anthology/P19-1124.pdf (typically people try to leverage the attention mechanism of neural MT models to do this)

This is a very difficult machine learning problem and while it's not impossible that simple approaches such as yours could work, it might be a good idea to study the existing work first. That being said, we have seen quite a few breakthroughs from surprisingly simple techniques in this field so who knows

答案2

得分: 5

我强烈推荐测试Awesome-Align。它依赖于多语言BERT（mBERT），结果看起来非常有前途。我甚至在阿拉伯语上进行了测试，它在一个复杂的对齐示例上表现出色，因为阿拉伯语是一个形态丰富的语言，我相信它会比德语等以拉丁语为基础的语言更具挑战性。

正如你所见，阿拉伯语中的一个词对应多个英语词汇，但Awesome-Align在很大程度上成功处理了多对多的映射。你可以试试，我相信它会满足你的需求。

还有一个Google Colab演示，网址为https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN

祝好运！

英文:

I highly recommend testing Awesome-Align. It relies on multilingual BERT (mBERT) and the results look very promising. I even tested it with Arabic, and it did a great job on a difficult alignment example since Arabic is a morphology-rich language, and I believe it would be more challenging than a Latin-based language such as German.

As you can see, one word in Arabic corresponds to multiple words in English, and yet Awesome-Align managed to handle the many-to-many mapping to a great extent. You may give it a try and I believe it will meet your needs.

There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN

Good luck!

答案3

得分: 4

最近，还有两篇论文使用双语/多语言词/上下文嵌入来进行单词对齐。它们都构建了一个二部图，其中单词根据它们的嵌入距离加权，并使用图算法来获取对齐结果。

其中一篇论文在图的两个部分之间进行最大匹配。由于匹配不是对称的，他们从两个方向进行匹配，并使用类似于FastAlign的对称化启发式方法。

另一篇论文仅简要提到对齐，使用了图上的最小权重边覆盖并将其作为对齐结果。

它们都声称比FastAlign更好。

英文:

Recently, there were also two papers using bi-/multilingual word/contextual embeddings to do the word alignment. Both of them construct a bipartite graph where the words are weighted with their embedding distances and use graph algorithms to get the alignment.

One paper does a maximum matching between the graph parts. Because the matching is not symmetrical, they do it from both sides and use similar symmetrization heuristics as FastAlign.

The other one mentions the alignment only briefly uses minimum-weighted edge cover on the graph and uses it as the alignment.

Both of them claim to be better than FastAlign.

答案4

得分: 1

https://pypi.org/project/systran-align/: 复制了FastAlign。似乎相对成熟。还要注意，原始的FastAlign代码包含了一个Python包装器（https://github.com/clab/fast_align/blob/master/src/force_align.py）。
https://www.nltk.org/api/nltk.align.html: 复制了大多数GIZA模型（性能和质量之间的一个良好折衷是IBM4）。然而，目前尚不清楚它经过了多少测试和维护得如何，因为人们通常更喜欢直接使用GIZA++。

现在，关于这个主题的大多数研究代码都是基于Python的，基于嵌入，例如https://github.com/cisnlp/simalign，https://github.com/neulab/awesome-align等。然而，目前还不清楚它们是否优于旧模型，如果是的话，用于哪些应用程序。最终，您需要在上下文意识（重新排序！）、精确度、召回率和运行时之间找到一个折衷方案。神经模型具有更好的上下文意识潜力，而统计模型具有更可预测的行为。

英文:

As the question is specifically addressing Python implementations, and Giza++ and FastAlign still seem to represent SOTA, one might look into

https://pypi.org/project/systran-align/: replicates FastAlign. Seems to be relatively mature. Also note that the original FastAlign code contains a Python wrapper (https://github.com/clab/fast_align/blob/master/src/force_align.py).
https://www.nltk.org/api/nltk.align.html: replicates most GIZA models (a good compromise between performance and quality is IBM4). However, it is rather unclear how thoroughly tested and how well maintained that is, as people generally prefer to work with GIZA++ directly.

Most research code on the topic will nowadays come in Python and be based on embeddings, e.g., https://github.com/cisnlp/simalign, https://github.com/neulab/awesome-align, etc. However, the jury is still out on whether they outperform the older models and if so, for which applications. In the end, you need to go for a compromise between context awareness (reordering!), precision, recall and runtime. Neural models have great potential on being more context aware, statistical models have more predictable behavior.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

高精度的Python词对齐算法

问题

答案1

答案2

答案3

答案4

如何向 Pandas 数据框按另一列分组添加具体数值的列？

Python：如何将编码的Unicode变量的字符串转换为二进制变量

合并两个字典并替换缺失的数值。

访问`chrome://`页面而无需使用Selenium的Python方法

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论