如何在包含列表的Pandas数据框中执行成对相关性?

huangapple go评论79阅读模式
英文:

How to perform a pairwise correlation in a Pandas dataframe containing lists?

问题

我创建了一个名为df2的Pandas数据框,当我打印它时,它看起来如下:

列在最左边的部分(100610 102311 104416 105923)是数据来自的四个主题。每个主题都有一个包含150个采样点的时间序列。例如,主题100610有时间序列[0.8872128569054152, 0.6275935748500376, 0.105...等等。

我正在运行一个滑动窗口的成对相关性方法(728个滑动窗口)。数字728表示循环中的最后一个滑动窗口。上面的输出是df2的最后一个滑动窗口的范例。

目标: 我想要运行四个主题之间(主题的时间序列之间)的成对相关性,如下所示:

pairwise_cor = df2.corr(method="pearson")

然而,这导致了pairwise_cor的以下空输出:

Empty DataFrame
Columns: []
Index: []

问题:
我应该如何修改pairwise_cor = df2.corr(method="pearson")的代码,以便不产生空输出?就我理解而言,问题是因为每一行包含一个值的列表或数组。如果我可以转置数据框,使每一列对应一个主题,每一行对应列表中的一个值,那么pairwise_cor = df2.corr(method="pearson")可能会起作用。这个理解正确吗?我应该如何修改数据框以相应地进行更改?

英文:

I created a Pandas dataframe, called df2, which looks as follows when I print it:

                                                      728
100610  [0.8872128569054152, 0.6275935748500376, 0.105...
102311  [-0.9484644612008593, -1.7934280570087853, -2....
104416  [0.1664251633793124, 0.1116268791242702, 0.050...
105923  [-0.2307886056759264, -0.5762187864896702, -0....

The column on the very left (100610 102311 104416 105923) are four subjects from which the data stems. Every subject has a time-series of 150 sampling points. For example, subject 100610 has the time-series [0.8872128569054152, 0.6275935748500376, 0.105... and so on.

I am running a sliding window pairwise correlation approach (728 sliding windows). The number 728 denotes the last sliding window in a for loop. The output above is a paradigmatic example of the very last sliding window of df2.

Aim: I would like to run a pairwise correlation between the four subjects (between the subjects’ time-series) as follows:

pairwise_cor = df2.corr(method="pearson")

However, this results in the following and empty output for pairwise_cor:

Empty DataFrame
Columns: []
Index: []

Question:
How do I have to modify the code for pairwise_cor = df2.corr(method="pearson") so that the code does not produce an empty output?

As far as my understanding goes, the problem is based on the fact that every row contains a list or array of values. pairwise_cor = df2.corr(method="pearson") would probably work if I could transpose the dataframe so that every column corresponds to one subject, and every row to one value of the list. Is that correct? How could I modify the dataframe so change it accordingly?

答案1

得分: 2

假设以下数据框:

import pandas as pd
import numpy as np

data = [[0.07717473, 0.90724758, 0.80752715, 0.04318562, 0.0569035, 0.12796062, 0.220677, 0.3716013, 0.74646015, -0.41114205],
        [-0.12252081, 0.03894384, -0.74668061, 0.00310963, 0.10716717, -0.42125924, 0.90771138, 0.10498123, 0.60872, -0.62587628],
        [-0.24917124, -0.76921359, 0.55519856, 0.56067116, -0.27319101, -0.01258496, 0.66428267, 0.53822299, -0.86883193, -0.15486245],
        [-0.14676444, 0.21910793, -0.11010598, 0.86445147, -0.92299316, -0.82828022, -0.7274392, 0.66965337, 0.67446502, -0.50343198]]
df = pd.Series(data, index=[100610, 102311, 104416, 105923]).to_frame(728)
print(df)

# 输出
                                                      728
100610  [0.07717473, 0.90724758, 0.80752715, 0.0431856...
102311  [-0.12252081, 0.03894384, -0.74668061, 0.00310...
104416  [-0.24917124, -0.76921359, 0.55519856, 0.56067...
105923  [-0.14676444, 0.21910793, -0.11010598, 0.86445...

您可以使用 np.corrcoef 处理向量:

>>> np.corrcoef(np.vstack(df.loc[:, 728]))

array([[ 1.        ,  0.19332151, -0.23634425,  0.40882581],
       [ 0.19332151,  1.        , -0.08039743,  0.15825333],
       [-0.23634425, -0.08039743,  1.        , -0.02293406],
       [ 0.40882581,  0.15825333, -0.02293406,  1.        ]])
英文:

Suppose the following dataframe:

import pandas as pd
import numpy as np

data = [[0.07717473, 0.90724758, 0.80752715, 0.04318562, 0.0569035, 0.12796062, 0.220677, 0.3716013, 0.74646015, -0.41114205],
        [-0.12252081, 0.03894384, -0.74668061, 0.00310963, 0.10716717, -0.42125924, 0.90771138, 0.10498123, 0.60872, -0.62587628],
        [-0.24917124, -0.76921359, 0.55519856, 0.56067116, -0.27319101, -0.01258496, 0.66428267, 0.53822299, -0.86883193, -0.15486245],
        [-0.14676444, 0.21910793, -0.11010598, 0.86445147, -0.92299316, -0.82828022, -0.7274392, 0.66965337, 0.67446502, -0.50343198]]
df = pd.Series(data, index=[100610, 102311, 104416, 105923]).to_frame(728)
print(df)

# Output
                                                      728
100610  [0.07717473, 0.90724758, 0.80752715, 0.0431856...
102311  [-0.12252081, 0.03894384, -0.74668061, 0.00310...
104416  [-0.24917124, -0.76921359, 0.55519856, 0.56067...
105923  [-0.14676444, 0.21910793, -0.11010598, 0.86445...

You can use np.corrcoef to deal with vectors:

>>> np.corrcoef(np.vstack(df.loc[:, 728]))

array([[ 1.        ,  0.19332151, -0.23634425,  0.40882581],
       [ 0.19332151,  1.        , -0.08039743,  0.15825333],
       [-0.23634425, -0.08039743,  1.        , -0.02293406],
       [ 0.40882581,  0.15825333, -0.02293406,  1.        ]])

huangapple
  • 本文由 发表于 2023年7月10日 21:50:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76654416.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定