英文:
Creating a "distance" dataframe from a pivot table
问题
以下是翻译好的部分:
好的,我有一个关系表格(想象一下人物拥有一个项目),我已经成功地将其转换成一个布尔矩阵。我想迈出下一步,找到任意两个给定用户之间的“距离”矩阵。
我知道如果我想计算一次距离,我可以基本上执行以下操作:
df = ...
df["val"] = 1
pivot = df.pivot(index="person", columns="hasa", values="val").fillna(0)
# 计算一个差异
(pivot["Alice"] - pivot["Carol"]).abs().sum()
但我不知道如何从这里得到一个完整的数据框。
初始表格
person | hasa |
---|---|
Alice | Apple |
Bob | Banana |
Carol | Carrot |
Bob | Apple |
透视表格
Apple | Banana | Carrot | |
---|---|---|---|
Alice | 1 | 0 | 0 |
Bob | 1 | 1 | 0 |
Carol | 0 | 0 | 1 |
目标表格
Alice | Bob | Carol | |
---|---|---|---|
Alice | 0 | 1 | 2 |
Bob | 1 | 0 | 3 |
Carol | 2 | 3 | 0 |
英文:
Alright, so I have a relationship table (think person has-an item) which I have successfully pivoted into a boolean matrix. I want to go the next step and find a matrix for the 'distance' between any two given users.
I know if I wanted to do a one off distance I can essentially do the following:
df = ...
df["val"] = 1
pivot = df.pivot(index="person", columns="hasa", values="val").fillna(0)
# compute one difference
(pivot["Alice"] - pivot["Carol"]).abs().sum()
I have no idea how to go from here to a full dataframe though.
Initial table
person | hasa |
---|---|
Alice | Apple |
Bob | Banana |
Carol | Carrot |
Bob | Apple |
Pivot Table
Apple | Banana | Carrot | |
---|---|---|---|
Alice | 1 | 0 | 0 |
Bob | 1 | 1 | 0 |
Carol | 0 | 0 | 1 |
Goal Table
Alice | Bob | Carol | |
---|---|---|---|
Alice | 0 | 1 | 2 |
Bob | 1 | 0 | 3 |
Carol | 2 | 3 | 0 |
答案1
得分: 2
你可以直接使用 crosstab
计算 "pivot",然后使用 scipy.spatial.distance.cdist
以默认的 "euclidean" 距离计算距离,只需对其进行平方操作:
from scipy.spatial.distance import cdist
pivot = pd.crosstab(df['person'], df['hasa'])
out = pd.DataFrame(cdist(pivot, pivot)**2,
index=pivot.index,
columns=pivot.index,
)
或者,你也可以使用带有自定义函数的 corr
进行第二步操作:
out = pivot.T.corr(lambda a,b: sum(a!=b))
或者使用 [tag:numpy] 广播方式(对于大型输入而言内存消耗较大)手动进行操作:
a = pivot.to_numpy()
out = (a[..., None] != a.T).sum(axis=1)
输出:
person Alice Bob Carol
person
Alice 0.0 1.0 2.0
Bob 1.0 0.0 3.0
Carol 2.0 3.0 0.0
英文:
You can compute the "pivot" directly using crosstab
, then the distance with scipy.spatial.distance.cdist
using the default "euclidean" metric, you only need to square it:
from scipy.spatial.distance import cdist
pivot = pd.crosstab(df['person'], df['hasa'])
out = pd.DataFrame(cdist(pivot, pivot)**2,
index=pivot.index,
columns=pivot.index,
)
Alternatively, you can also use corr
with a custom function for the second step:
out = pivot.T.corr(lambda a,b: sum(a!=b))
Or manually with [tag:numpy] broadcasting (memory expensive for a large input):
a = pivot.to_numpy()
out = (a[..., None] != a.T).sum(axis=1)
Output:
person Alice Bob Carol
person
Alice 0.0 1.0 2.0
Bob 1.0 0.0 3.0
Carol 2.0 3.0 0.0
答案2
得分: 1
让我们从简单的开始。您可以通过创建另一个DataFrame来存储这些距离来计算您数据透视表中所有行之间的距离。这里的主要逻辑是遍历数据透视表的每一行,并计算与其他每一行的距离。
# 创建初始DataFrame
df = pd.DataFrame({
'person': ['Alice', 'Bob', 'Carol', 'Bob'],
'hasa': ['Apple', 'Banana', 'Carrot', 'Apple']
})
# 数据透视DataFrame
pivot_df = pd.pivot_table(df, index='person', columns='hasa', aggfunc=len, fill_value=0)
# 为距离创建空的DataFrame
distance_df = pd.DataFrame(index=pivot_df.index, columns=pivot_df.index)
# 用距离填充DataFrame
for person1 in pivot_df.index:
for person2 in pivot_df.index:
distance_df.loc[person1, person2] = (pivot_df.loc[person1] - pivot_df.loc[person2]).abs().sum()
现在,distance_df
应该包含正确的数值。请注意,生成的DataFrame沿主对角线是对称的(因为从Alice到Bob的距离与从Bob到Alice的距离是相同的)。主对角线填充为零(因为一个人到自己的距离始终为零)。
英文:
Let's start simple. You can calculate the distances between all rows in your pivot table by creating another DataFrame that will store these distances. The main logic here is to iterate through each row of the pivot table and calculate the distance to every other row.
# Create initial DataFrame
df = pd.DataFrame({
'person': ['Alice', 'Bob', 'Carol', 'Bob'],
'hasa': ['Apple', 'Banana', 'Carrot', 'Apple']
})
# Pivot DataFrame
pivot_df = pd.pivot_table(df, index='person', columns='hasa', aggfunc=len, fill_value=0)
# Create empty DataFrame for distances
distance_df = pd.DataFrame(index=pivot_df.index, columns=pivot_df.index)
# Fill DataFrame with distances
for person1 in pivot_df.index:
for person2 in pivot_df.index:
distance_df.loc[person1, person2] = (pivot_df.loc[person1] - pivot_df.loc[person2]).abs().sum()
Now, distance_df
should have the right values. Note that the resulting DataFrame is symmetrical along the main diagonal (because the distance from Alice to Bob is the same as the distance from Bob to Alice). The main diagonal is filled with zeros (since the distance from a person to themself is always zero).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论