问题

我正在尝试找到两个数组之间最相似的词嵌入，假设第一个数组A的维度为[100, 50]，其中100是单词数，50是嵌入维度，另一方面，我有许多其他单词存储在一个维度为[400000, 50]的数组B中，我想要找到每个来自A的单词的前10个（最相似，余弦相似度最高）单词。换句话说，对于A中的每个单词，我想要在B中找到与之余弦相似度最高的10个单词。

我使用了两个循环来解决这个问题，但我想知道是否有更快的方法，因为我的方法在增加A样本数时需要花费很多时间，如果有任何技巧或建议，将非常有帮助。我正在使用来自torch的cosine_similarity函数，如果有其他更快的选择，那也很棒。我已经尝试了在这里发布的解决方案，但我想知道是否有更好的解决方案，因为这个解决方案已经有5年了。提前感谢。

英文:

I am trying to find the most similar word embeddings between 2 arrays, let's say the first array A has dimension [100, 50], where 100 is the number of words and 50 the embedding dimension, on the other hand I have many other words stored in an array B of dimension [400000, 50], I want to find the top 10 (most similar, highest cosine similarity) words for each word from A in B. In other words, for each word in A, I want to find 10 words in B with the highest cosine similarity.

I solved it using 2 for loops, but I want to know if there is any way to do this faster since my method takes time when I increase the number of samples on A, any trick or advice would be helpful. I am using the cosine_similarity function from torch, if there is another choice that is faster, it would also be great. I have tried the solution posted here, but I would like to know if there is any better solution as this one is 5 years old. Thanks in advance.

答案1

得分: 0

一个朋友分享了他的解决方案，有一种更快的方法可以做到，这是代码：
    def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
        return torch.cosine_similarity(x[..., None], y.T, dim=-2)
这样输出是一个矩阵，无需循环遍历两个数据集。

英文:

A friend shared his solution, and there is a way to do it much faster, this is the code:

def cosine_similarity(x: Tensor, y: Tensor) -&gt; Tensor:
    return torch.cosine_similarity(x[..., None], y.T, dim=-2)

This way the output is a matrix and there is no need to loop through both datasets.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

两个数组的词嵌入余弦相似度

问题

答案1

如何从包中的另一个文件夹导入文件

docker compose中的Flask应用返回空响应

如何在终端中打印文件内容

无法从Python的网店中抓取价格。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。