2023年3月8日 16:39:35go评论90阅读模式

英文:

"Why are the cosine similarities calculated by the library and by myself different?"

问题

The difference in similarity values you are observing between the Surprise library and your manual calculation is due to how the library handles missing (NaN) values in the item-based similarity calculation.

In the Surprise library, when calculating item-based similarities using the KNNWithMeans algorithm, the library internally handles missing values by imputing them with the mean of the user's ratings. This is done to avoid bias in the similarity calculations caused by missing values.

In your manual calculation using NumPy, you didn't account for the missing values (NaN), which can lead to a different similarity score. To get results similar to Surprise, you should also impute missing values with the mean value of the respective user's ratings before calculating cosine similarity.

So, if you want to get a similarity value similar to what Surprise produces, you should impute NaN values with the user's mean ratings for each item before calculating cosine similarity. This should align your manual calculation with the Surprise library's results.

英文:

I'm currently building a book recommendation system and I want to use KNN algorithm for collaborative filtering. I think I know the process of KNN algorithm well, and I want to use item-based approach for which I need to calculate the similarity between item vectors. However, there's a difference between the similarity calculated by the library and the one I calculated myself, and I'm not sure what the cause is. Can you help me out?

from surprise import Dataset, Reader, KNNWithMeans
# 데이터프레임 생성
ratings_dict = {
    &quot;item&quot;: [1, 2, 1, 2, 1, 2, 1, 2, 1],
    &quot;user&quot;: [&#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;C&#39;, &#39;C&#39;, &#39;D&#39;, &#39;D&#39;, &#39;E&#39;],
    &quot;rating&quot;: [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)
# Surprise 라이브러리에서 사용할 데이터셋 형태로 변환
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[[&#39;user&#39;, &#39;item&#39;, &#39;rating&#39;]], reader)
# 유사도 행렬 계산 (item_based)
sim_options = {&#39;name&#39;: &#39;cosine&#39;, &#39;user_based&#39;: False}
algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
similarity_matrix = algo.compute_similarities()
print(similarity_matrix)

this code results

[[1. 0.96954671]
[0.96954671 1. ]]

item    1    2
user          
A     1.0  2.0
B     2.0  4.0
C     2.5  4.0
D     4.5  5.0
E     3.0  NaN

but

import numpy as np
# 두 벡터 정의
vector1 = np.array([1, 2, 2.5, 4.5, 3])
vector2 = np.array([2, 4, 4, 5, 0])
# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)

this code results

0.8550598237348973

I think the surprise library filled NaN values with something other than 0. I expected it to be 0, but it seems like another value was used instead.

I tried ChatGPT, but it couldn't help me solve the issue.

答案1

得分: 2

vector1 = np.array([1, 2, 2.5, 4.5])
vector2 = np.array([2, 4, 4, 5])
# Calculate the cosine similarity
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)

你的代码的第一部分计算了两个4D向量的余弦相似度，忽略了最后一个值，其中一个是NaN。

英文:

vector1 = np.array([1, 2, 2.5, 4.5])
vector2 = np.array([2, 4, 4, 5])
# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)

The first part of your code just calculates the cosine similarity of the 4D vectors, omitting the last values, one of which is NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

“为什么由库和我自己计算的余弦相似度不同？”

问题

答案1

FastAPI Python的PUT请求

合并字典以保留相同值以及不同值。

从字典中获取特定键，并将其余部分展开到另一个字典。

无法点击按钮 Selenium Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。