创建一个“距离”数据框从一个数据透视表

huangapple go评论82阅读模式
英文:

Creating a "distance" dataframe from a pivot table

问题

以下是翻译好的部分:

好的,我有一个关系表格(想象一下人物拥有一个项目),我已经成功地将其转换成一个布尔矩阵。我想迈出下一步,找到任意两个给定用户之间的“距离”矩阵。

我知道如果我想计算一次距离,我可以基本上执行以下操作:

df = ...
df["val"] = 1

pivot = df.pivot(index="person", columns="hasa", values="val").fillna(0)

# 计算一个差异
(pivot["Alice"] - pivot["Carol"]).abs().sum()

但我不知道如何从这里得到一个完整的数据框。

初始表格

person hasa
Alice Apple
Bob Banana
Carol Carrot
Bob Apple

透视表格

Apple Banana Carrot
Alice 1 0 0
Bob 1 1 0
Carol 0 0 1

目标表格

Alice Bob Carol
Alice 0 1 2
Bob 1 0 3
Carol 2 3 0
英文:

Alright, so I have a relationship table (think person has-an item) which I have successfully pivoted into a boolean matrix. I want to go the next step and find a matrix for the 'distance' between any two given users.

I know if I wanted to do a one off distance I can essentially do the following:

df = ...
df["val"] = 1

pivot = df.pivot(index="person", columns="hasa", values="val").fillna(0)

# compute one difference
(pivot["Alice"] - pivot["Carol"]).abs().sum()

I have no idea how to go from here to a full dataframe though.


Initial table

person hasa
Alice Apple
Bob Banana
Carol Carrot
Bob Apple

Pivot Table

Apple Banana Carrot
Alice 1 0 0
Bob 1 1 0
Carol 0 0 1

Goal Table

Alice Bob Carol
Alice 0 1 2
Bob 1 0 3
Carol 2 3 0

答案1

得分: 2

你可以直接使用 crosstab 计算 "pivot",然后使用 scipy.spatial.distance.cdist 以默认的 "euclidean" 距离计算距离,只需对其进行平方操作:

from scipy.spatial.distance import cdist

pivot = pd.crosstab(df['person'], df['hasa'])

out = pd.DataFrame(cdist(pivot, pivot)**2,
                   index=pivot.index,
                   columns=pivot.index,
                  )

或者,你也可以使用带有自定义函数的 corr 进行第二步操作:

out = pivot.T.corr(lambda a,b: sum(a!=b))

或者使用 [tag:numpy] 广播方式(对于大型输入而言内存消耗较大)手动进行操作:

a = pivot.to_numpy()
out = (a[..., None] != a.T).sum(axis=1)

输出:

person  Alice  Bob  Carol
person                   
Alice     0.0  1.0    2.0
Bob       1.0  0.0    3.0
Carol     2.0  3.0    0.0
英文:

You can compute the "pivot" directly using crosstab, then the distance with scipy.spatial.distance.cdist using the default "euclidean" metric, you only need to square it:

from scipy.spatial.distance import cdist

pivot = pd.crosstab(df['person'], df['hasa'])

out = pd.DataFrame(cdist(pivot, pivot)**2,
                   index=pivot.index,
                   columns=pivot.index,
                  )

Alternatively, you can also use corr with a custom function for the second step:

out = pivot.T.corr(lambda a,b: sum(a!=b))

Or manually with [tag:numpy] broadcasting (memory expensive for a large input):

a = pivot.to_numpy()
out = (a[..., None] != a.T).sum(axis=1)

Output:

person  Alice  Bob  Carol
person                   
Alice     0.0  1.0    2.0
Bob       1.0  0.0    3.0
Carol     2.0  3.0    0.0

答案2

得分: 1

让我们从简单的开始。您可以通过创建另一个DataFrame来存储这些距离来计算您数据透视表中所有行之间的距离。这里的主要逻辑是遍历数据透视表的每一行,并计算与其他每一行的距离。

# 创建初始DataFrame
df = pd.DataFrame({
    'person': ['Alice', 'Bob', 'Carol', 'Bob'],
    'hasa': ['Apple', 'Banana', 'Carrot', 'Apple']
})

# 数据透视DataFrame
pivot_df = pd.pivot_table(df, index='person', columns='hasa', aggfunc=len, fill_value=0)

# 为距离创建空的DataFrame
distance_df = pd.DataFrame(index=pivot_df.index, columns=pivot_df.index)

# 用距离填充DataFrame
for person1 in pivot_df.index:
    for person2 in pivot_df.index:
        distance_df.loc[person1, person2] = (pivot_df.loc[person1] - pivot_df.loc[person2]).abs().sum()

现在,distance_df 应该包含正确的数值。请注意,生成的DataFrame沿主对角线是对称的(因为从Alice到Bob的距离与从Bob到Alice的距离是相同的)。主对角线填充为零(因为一个人到自己的距离始终为零)。

英文:

Let's start simple. You can calculate the distances between all rows in your pivot table by creating another DataFrame that will store these distances. The main logic here is to iterate through each row of the pivot table and calculate the distance to every other row.

# Create initial DataFrame
df = pd.DataFrame({
    'person': ['Alice', 'Bob', 'Carol', 'Bob'],
    'hasa': ['Apple', 'Banana', 'Carrot', 'Apple']
})

# Pivot DataFrame
pivot_df = pd.pivot_table(df, index='person', columns='hasa', aggfunc=len, fill_value=0)

# Create empty DataFrame for distances
distance_df = pd.DataFrame(index=pivot_df.index, columns=pivot_df.index)

# Fill DataFrame with distances
for person1 in pivot_df.index:
    for person2 in pivot_df.index:
        distance_df.loc[person1, person2] = (pivot_df.loc[person1] - pivot_df.loc[person2]).abs().sum()

Now, distance_df should have the right values. Note that the resulting DataFrame is symmetrical along the main diagonal (because the distance from Alice to Bob is the same as the distance from Bob to Alice). The main diagonal is filled with zeros (since the distance from a person to themself is always zero).

huangapple
  • 本文由 发表于 2023年6月29日 06:52:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76577150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定