2023年6月29日 06:52:35go评论113阅读模式

英文:

Creating a "distance" dataframe from a pivot table

问题

以下是翻译好的部分：

好的，我有一个关系表格（想象一下人物拥有一个项目），我已经成功地将其转换成一个布尔矩阵。我想迈出下一步，找到任意两个给定用户之间的“距离”矩阵。

我知道如果我想计算一次距离，我可以基本上执行以下操作：

df = ...
df["val"] = 1
pivot = df.pivot(index="person", columns="hasa", values="val").fillna(0)
# 计算一个差异
(pivot["Alice"] - pivot["Carol"]).abs().sum()

但我不知道如何从这里得到一个完整的数据框。

初始表格

person	hasa
Alice	Apple
Bob	Banana
Carol	Carrot
Bob	Apple

透视表格

	Apple	Banana	Carrot
Alice	1	0	0
Bob	1	1	0
Carol	0	0	1

目标表格

	Alice	Bob	Carol
Alice	0	1	2
Bob	1	0	3
Carol	2	3	0

英文:

Alright, so I have a relationship table (think person has-an item) which I have successfully pivoted into a boolean matrix. I want to go the next step and find a matrix for the 'distance' between any two given users.

I know if I wanted to do a one off distance I can essentially do the following:

df = ...
df[&quot;val&quot;] = 1
pivot = df.pivot(index=&quot;person&quot;, columns=&quot;hasa&quot;, values=&quot;val&quot;).fillna(0)
# compute one difference
(pivot[&quot;Alice&quot;] - pivot[&quot;Carol&quot;]).abs().sum()

I have no idea how to go from here to a full dataframe though.

Initial table

person	hasa
Alice	Apple
Bob	Banana
Carol	Carrot
Bob	Apple

Pivot Table

	Apple	Banana	Carrot
Alice	1	0	0
Bob	1	1	0
Carol	0	0	1

Goal Table

	Alice	Bob	Carol
Alice	0	1	2
Bob	1	0	3
Carol	2	3	0

答案1

得分: 2

你可以直接使用 crosstab 计算 "pivot"，然后使用 scipy.spatial.distance.cdist 以默认的 "euclidean" 距离计算距离，只需对其进行平方操作：

from scipy.spatial.distance import cdist
pivot = pd.crosstab(df['person'], df['hasa'])
out = pd.DataFrame(cdist(pivot, pivot)**2,
                   index=pivot.index,
                   columns=pivot.index,
                  )

或者，你也可以使用带有自定义函数的 corr 进行第二步操作：

out = pivot.T.corr(lambda a,b: sum(a!=b))

或者使用 [tag:numpy] 广播方式（对于大型输入而言内存消耗较大）手动进行操作：

a = pivot.to_numpy()
out = (a[..., None] != a.T).sum(axis=1)

输出：

person  Alice  Bob  Carol
person                   
Alice     0.0  1.0    2.0
Bob       1.0  0.0    3.0
Carol     2.0  3.0    0.0

英文:

You can compute the "pivot" directly using crosstab, then the distance with scipy.spatial.distance.cdist using the default "euclidean" metric, you only need to square it:

from scipy.spatial.distance import cdist
pivot = pd.crosstab(df[&#39;person&#39;], df[&#39;hasa&#39;])
out = pd.DataFrame(cdist(pivot, pivot)**2,
                   index=pivot.index,
                   columns=pivot.index,
                  )

Alternatively, you can also use corr with a custom function for the second step:

out = pivot.T.corr(lambda a,b: sum(a!=b))

Or manually with [tag:numpy] broadcasting (memory expensive for a large input):

a = pivot.to_numpy()
out = (a[..., None] != a.T).sum(axis=1)

Output:

person  Alice  Bob  Carol
person                   
Alice     0.0  1.0    2.0
Bob       1.0  0.0    3.0
Carol     2.0  3.0    0.0

答案2

得分: 1

让我们从简单的开始。您可以通过创建另一个DataFrame来存储这些距离来计算您数据透视表中所有行之间的距离。这里的主要逻辑是遍历数据透视表的每一行，并计算与其他每一行的距离。

# 创建初始DataFrame
df = pd.DataFrame({
    'person': ['Alice', 'Bob', 'Carol', 'Bob'],
    'hasa': ['Apple', 'Banana', 'Carrot', 'Apple']
})
# 数据透视DataFrame
pivot_df = pd.pivot_table(df, index='person', columns='hasa', aggfunc=len, fill_value=0)
# 为距离创建空的DataFrame
distance_df = pd.DataFrame(index=pivot_df.index, columns=pivot_df.index)
# 用距离填充DataFrame
for person1 in pivot_df.index:
    for person2 in pivot_df.index:
        distance_df.loc[person1, person2] = (pivot_df.loc[person1] - pivot_df.loc[person2]).abs().sum()

现在，distance_df 应该包含正确的数值。请注意，生成的DataFrame沿主对角线是对称的（因为从Alice到Bob的距离与从Bob到Alice的距离是相同的）。主对角线填充为零（因为一个人到自己的距离始终为零）。

英文:

Let's start simple. You can calculate the distances between all rows in your pivot table by creating another DataFrame that will store these distances. The main logic here is to iterate through each row of the pivot table and calculate the distance to every other row.

# Create initial DataFrame
df = pd.DataFrame({
    &#39;person&#39;: [&#39;Alice&#39;, &#39;Bob&#39;, &#39;Carol&#39;, &#39;Bob&#39;],
    &#39;hasa&#39;: [&#39;Apple&#39;, &#39;Banana&#39;, &#39;Carrot&#39;, &#39;Apple&#39;]
})
# Pivot DataFrame
pivot_df = pd.pivot_table(df, index=&#39;person&#39;, columns=&#39;hasa&#39;, aggfunc=len, fill_value=0)
# Create empty DataFrame for distances
distance_df = pd.DataFrame(index=pivot_df.index, columns=pivot_df.index)
# Fill DataFrame with distances
for person1 in pivot_df.index:
    for person2 in pivot_df.index:
        distance_df.loc[person1, person2] = (pivot_df.loc[person1] - pivot_df.loc[person2]).abs().sum()

Now, distance_df should have the right values. Note that the resulting DataFrame is symmetrical along the main diagonal (because the distance from Alice to Bob is the same as the distance from Bob to Alice). The main diagonal is filled with zeros (since the distance from a person to themself is always zero).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建一个“距离”数据框从一个数据透视表

问题

初始表格

透视表格

目标表格

Initial table

Pivot Table

Goal Table

答案1

答案2

可以将一个大型字典列表转换为字符串，然后在Python中再次转换为列表吗？

Name ‘stopwords’ is not defined but I already import the package.

线性规划求解器忽略约束条件。

Python WebScraper w/ BeautifulSoup: 未爬取所有页面

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。