两个数组的词嵌入余弦相似度

huangapple go评论65阅读模式
英文:

Cosine similarity between two arrays for word embeddings

问题

我正在尝试找到两个数组之间最相似的词嵌入,假设第一个数组A的维度为[100, 50],其中100是单词数,50是嵌入维度,另一方面,我有许多其他单词存储在一个维度为[400000, 50]的数组B中,我想要找到每个来自A的单词的前10个(最相似,余弦相似度最高)单词。换句话说,对于A中的每个单词,我想要在B中找到与之余弦相似度最高的10个单词。

我使用了两个循环来解决这个问题,但我想知道是否有更快的方法,因为我的方法在增加A样本数时需要花费很多时间,如果有任何技巧或建议,将非常有帮助。我正在使用来自torchcosine_similarity函数,如果有其他更快的选择,那也很棒。我已经尝试了在这里发布的解决方案,但我想知道是否有更好的解决方案,因为这个解决方案已经有5年了。提前感谢。

英文:

I am trying to find the most similar word embeddings between 2 arrays, let's say the first array A has dimension [100, 50], where 100 is the number of words and 50 the embedding dimension, on the other hand I have many other words stored in an array B of dimension [400000, 50], I want to find the top 10 (most similar, highest cosine similarity) words for each word from A in B. In other words, for each word in A, I want to find 10 words in B with the highest cosine similarity.

I solved it using 2 for loops, but I want to know if there is any way to do this faster since my method takes time when I increase the number of samples on A, any trick or advice would be helpful. I am using the cosine_similarity function from torch, if there is another choice that is faster, it would also be great. I have tried the solution posted here, but I would like to know if there is any better solution as this one is 5 years old. Thanks in advance.

答案1

得分: 0

一个朋友分享了他的解决方案有一种更快的方法可以做到这是代码

    def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
        return torch.cosine_similarity(x[..., None], y.T, dim=-2)

这样输出是一个矩阵无需循环遍历两个数据集
英文:

A friend shared his solution, and there is a way to do it much faster, this is the code:

def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
    return torch.cosine_similarity(x[..., None], y.T, dim=-2)

This way the output is a matrix and there is no need to loop through both datasets.

huangapple
  • 本文由 发表于 2023年6月26日 20:55:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76556898.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定