“为什么由库和我自己计算的余弦相似度不同?”

huangapple go评论55阅读模式
英文:

"Why are the cosine similarities calculated by the library and by myself different?"

问题

The difference in similarity values you are observing between the Surprise library and your manual calculation is due to how the library handles missing (NaN) values in the item-based similarity calculation.

In the Surprise library, when calculating item-based similarities using the KNNWithMeans algorithm, the library internally handles missing values by imputing them with the mean of the user's ratings. This is done to avoid bias in the similarity calculations caused by missing values.

In your manual calculation using NumPy, you didn't account for the missing values (NaN), which can lead to a different similarity score. To get results similar to Surprise, you should also impute missing values with the mean value of the respective user's ratings before calculating cosine similarity.

So, if you want to get a similarity value similar to what Surprise produces, you should impute NaN values with the user's mean ratings for each item before calculating cosine similarity. This should align your manual calculation with the Surprise library's results.

英文:

I'm currently building a book recommendation system and I want to use KNN algorithm for collaborative filtering. I think I know the process of KNN algorithm well, and I want to use item-based approach for which I need to calculate the similarity between item vectors. However, there's a difference between the similarity calculated by the library and the one I calculated myself, and I'm not sure what the cause is. Can you help me out?

from surprise import Dataset, Reader, KNNWithMeans
# 데이터프레임 생성
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)


# Surprise 라이브러리에서 사용할 데이터셋 형태로 변환
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

# 유사도 행렬 계산 (item_based)
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)

similarity_matrix = algo.compute_similarities()
print(similarity_matrix)

this code results

[[1. 0.96954671]
[0.96954671 1. ]]

item    1    2
user          
A     1.0  2.0
B     2.0  4.0
C     2.5  4.0
D     4.5  5.0
E     3.0  NaN

but

import numpy as np

# 두 벡터 정의
vector1 = np.array([1, 2, 2.5, 4.5, 3])
vector2 = np.array([2, 4, 4, 5, 0])


# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


print(cosine_sim_1)

this code results

0.8550598237348973

I think the surprise library filled NaN values with something other than 0. I expected it to be 0, but it seems like another value was used instead.

I tried ChatGPT, but it couldn't help me solve the issue.

答案1

得分: 2

vector1 = np.array([1, 2, 2.5, 4.5])
vector2 = np.array([2, 4, 4, 5])

# Calculate the cosine similarity
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)

你的代码的第一部分计算了两个4D向量的余弦相似度,忽略了最后一个值,其中一个是NaN。

英文:
vector1 = np.array([1, 2, 2.5, 4.5])
vector2 = np.array([2, 4, 4, 5])

# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)

The first part of your code just calculates the cosine similarity of the 4D vectors, omitting the last values, one of which is NaN

huangapple
  • 本文由 发表于 2023年3月8日 16:39:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670883.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定