英文:
Cosine similarity between two arrays for word embeddings
问题
我正在尝试找到两个数组之间最相似的词嵌入,假设第一个数组A
的维度为[100, 50]
,其中100是单词数,50是嵌入维度,另一方面,我有许多其他单词存储在一个维度为[400000, 50]
的数组B
中,我想要找到每个来自A
的单词的前10个(最相似,余弦相似度最高)单词。换句话说,对于A
中的每个单词,我想要在B
中找到与之余弦相似度最高的10个单词。
我使用了两个循环来解决这个问题,但我想知道是否有更快的方法,因为我的方法在增加A
样本数时需要花费很多时间,如果有任何技巧或建议,将非常有帮助。我正在使用来自torch
的cosine_similarity
函数,如果有其他更快的选择,那也很棒。我已经尝试了在这里发布的解决方案,但我想知道是否有更好的解决方案,因为这个解决方案已经有5年了。提前感谢。
英文:
I am trying to find the most similar word embeddings between 2 arrays, let's say the first array A
has dimension [100, 50]
, where 100 is the number of words and 50 the embedding dimension, on the other hand I have many other words stored in an array B
of dimension [400000, 50]
, I want to find the top 10 (most similar, highest cosine similarity) words for each word from A
in B
. In other words, for each word in A
, I want to find 10 words in B
with the highest cosine similarity.
I solved it using 2 for loops, but I want to know if there is any way to do this faster since my method takes time when I increase the number of samples on A
, any trick or advice would be helpful. I am using the cosine_similarity function from torch
, if there is another choice that is faster, it would also be great. I have tried the solution posted here, but I would like to know if there is any better solution as this one is 5 years old. Thanks in advance.
答案1
得分: 0
一个朋友分享了他的解决方案,有一种更快的方法可以做到,这是代码:
def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
return torch.cosine_similarity(x[..., None], y.T, dim=-2)
这样输出是一个矩阵,无需循环遍历两个数据集。
英文:
A friend shared his solution, and there is a way to do it much faster, this is the code:
def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
return torch.cosine_similarity(x[..., None], y.T, dim=-2)
This way the output is a matrix and there is no need to loop through both datasets.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论