2023年7月10日 21:50:04go评论106阅读模式

英文:

How to perform a pairwise correlation in a Pandas dataframe containing lists?

问题

我创建了一个名为df2的Pandas数据框，当我打印它时，它看起来如下：

列在最左边的部分（100610 102311 104416 105923）是数据来自的四个主题。每个主题都有一个包含150个采样点的时间序列。例如，主题100610有时间序列[0.8872128569054152, 0.6275935748500376, 0.105...等等。

我正在运行一个滑动窗口的成对相关性方法（728个滑动窗口）。数字728表示循环中的最后一个滑动窗口。上面的输出是df2的最后一个滑动窗口的范例。

目标： 我想要运行四个主题之间（主题的时间序列之间）的成对相关性，如下所示：

pairwise_cor = df2.corr(method="pearson")

然而，这导致了pairwise_cor的以下空输出：

Empty DataFrame
Columns: []
Index: []

问题：
我应该如何修改pairwise_cor = df2.corr(method="pearson")的代码，以便不产生空输出？就我理解而言，问题是因为每一行包含一个值的列表或数组。如果我可以转置数据框，使每一列对应一个主题，每一行对应列表中的一个值，那么pairwise_cor = df2.corr(method="pearson")可能会起作用。这个理解正确吗？我应该如何修改数据框以相应地进行更改？

英文:

I created a Pandas dataframe, called df2, which looks as follows when I print it:

                                                      728
100610  [0.8872128569054152, 0.6275935748500376, 0.105...
102311  [-0.9484644612008593, -1.7934280570087853, -2....
104416  [0.1664251633793124, 0.1116268791242702, 0.050...
105923  [-0.2307886056759264, -0.5762187864896702, -0....

The column on the very left (100610 102311 104416 105923) are four subjects from which the data stems. Every subject has a time-series of 150 sampling points. For example, subject 100610 has the time-series [0.8872128569054152, 0.6275935748500376, 0.105... and so on.

I am running a sliding window pairwise correlation approach (728 sliding windows). The number 728 denotes the last sliding window in a for loop. The output above is a paradigmatic example of the very last sliding window of df2.

Aim: I would like to run a pairwise correlation between the four subjects (between the subjects’ time-series) as follows:

pairwise_cor = df2.corr(method=&quot;pearson&quot;)

However, this results in the following and empty output for pairwise_cor:

Empty DataFrame
Columns: []
Index: []

Question:
How do I have to modify the code for pairwise_cor = df2.corr(method="pearson") so that the code does not produce an empty output?

As far as my understanding goes, the problem is based on the fact that every row contains a list or array of values. pairwise_cor = df2.corr(method="pearson") would probably work if I could transpose the dataframe so that every column corresponds to one subject, and every row to one value of the list. Is that correct? How could I modify the dataframe so change it accordingly?

答案1

得分: 2

假设以下数据框：

import pandas as pd
import numpy as np
data = [[0.07717473, 0.90724758, 0.80752715, 0.04318562, 0.0569035, 0.12796062, 0.220677, 0.3716013, 0.74646015, -0.41114205],
        [-0.12252081, 0.03894384, -0.74668061, 0.00310963, 0.10716717, -0.42125924, 0.90771138, 0.10498123, 0.60872, -0.62587628],
        [-0.24917124, -0.76921359, 0.55519856, 0.56067116, -0.27319101, -0.01258496, 0.66428267, 0.53822299, -0.86883193, -0.15486245],
        [-0.14676444, 0.21910793, -0.11010598, 0.86445147, -0.92299316, -0.82828022, -0.7274392, 0.66965337, 0.67446502, -0.50343198]]
df = pd.Series(data, index=[100610, 102311, 104416, 105923]).to_frame(728)
print(df)
# 输出
                                                      728
100610  [0.07717473, 0.90724758, 0.80752715, 0.0431856...
102311  [-0.12252081, 0.03894384, -0.74668061, 0.00310...
104416  [-0.24917124, -0.76921359, 0.55519856, 0.56067...
105923  [-0.14676444, 0.21910793, -0.11010598, 0.86445...

您可以使用 np.corrcoef 处理向量：

&gt;&gt;&gt; np.corrcoef(np.vstack(df.loc[:, 728]))
array([[ 1.        ,  0.19332151, -0.23634425,  0.40882581],
       [ 0.19332151,  1.        , -0.08039743,  0.15825333],
       [-0.23634425, -0.08039743,  1.        , -0.02293406],
       [ 0.40882581,  0.15825333, -0.02293406,  1.        ]])

英文:

Suppose the following dataframe:

import pandas as pd
import numpy as np
data = [[0.07717473, 0.90724758, 0.80752715, 0.04318562, 0.0569035, 0.12796062, 0.220677, 0.3716013, 0.74646015, -0.41114205],
        [-0.12252081, 0.03894384, -0.74668061, 0.00310963, 0.10716717, -0.42125924, 0.90771138, 0.10498123, 0.60872, -0.62587628],
        [-0.24917124, -0.76921359, 0.55519856, 0.56067116, -0.27319101, -0.01258496, 0.66428267, 0.53822299, -0.86883193, -0.15486245],
        [-0.14676444, 0.21910793, -0.11010598, 0.86445147, -0.92299316, -0.82828022, -0.7274392, 0.66965337, 0.67446502, -0.50343198]]
df = pd.Series(data, index=[100610, 102311, 104416, 105923]).to_frame(728)
print(df)
# Output
                                                      728
100610  [0.07717473, 0.90724758, 0.80752715, 0.0431856...
102311  [-0.12252081, 0.03894384, -0.74668061, 0.00310...
104416  [-0.24917124, -0.76921359, 0.55519856, 0.56067...
105923  [-0.14676444, 0.21910793, -0.11010598, 0.86445...

You can use np.corrcoef to deal with vectors:

&gt;&gt;&gt; np.corrcoef(np.vstack(df.loc[:, 728]))
array([[ 1.        ,  0.19332151, -0.23634425,  0.40882581],
       [ 0.19332151,  1.        , -0.08039743,  0.15825333],
       [-0.23634425, -0.08039743,  1.        , -0.02293406],
       [ 0.40882581,  0.15825333, -0.02293406,  1.        ]])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在包含列表的Pandas数据框中执行成对相关性？

问题

答案1

Python多进程在同一个AWS Glue 4.0作业中卡住

Python-Selenium: 如何切换到位于shadow DOM内部的 ‘switch_to.active_element’ 输入元素？

理解Flask中的路由/ URL映射。

I want to select data using ranges of longitudes and latitudes in a NetCDF4 file using Python on Windows. I can't even open the dataset with xarray

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。