如何在包含列表的Pandas数据框中执行成对相关性?

huangapple go评论106阅读模式
英文:

How to perform a pairwise correlation in a Pandas dataframe containing lists?

问题

我创建了一个名为df2的Pandas数据框,当我打印它时,它看起来如下:

列在最左边的部分(100610 102311 104416 105923)是数据来自的四个主题。每个主题都有一个包含150个采样点的时间序列。例如,主题100610有时间序列[0.8872128569054152, 0.6275935748500376, 0.105...等等。

我正在运行一个滑动窗口的成对相关性方法(728个滑动窗口)。数字728表示循环中的最后一个滑动窗口。上面的输出是df2的最后一个滑动窗口的范例。

目标: 我想要运行四个主题之间(主题的时间序列之间)的成对相关性,如下所示:

  1. pairwise_cor = df2.corr(method="pearson")

然而,这导致了pairwise_cor的以下空输出:

  1. Empty DataFrame
  2. Columns: []
  3. Index: []

问题:
我应该如何修改pairwise_cor = df2.corr(method="pearson")的代码,以便不产生空输出?就我理解而言,问题是因为每一行包含一个值的列表或数组。如果我可以转置数据框,使每一列对应一个主题,每一行对应列表中的一个值,那么pairwise_cor = df2.corr(method="pearson")可能会起作用。这个理解正确吗?我应该如何修改数据框以相应地进行更改?

英文:

I created a Pandas dataframe, called df2, which looks as follows when I print it:

  1. 728
  2. 100610 [0.8872128569054152, 0.6275935748500376, 0.105...
  3. 102311 [-0.9484644612008593, -1.7934280570087853, -2....
  4. 104416 [0.1664251633793124, 0.1116268791242702, 0.050...
  5. 105923 [-0.2307886056759264, -0.5762187864896702, -0....

The column on the very left (100610 102311 104416 105923) are four subjects from which the data stems. Every subject has a time-series of 150 sampling points. For example, subject 100610 has the time-series [0.8872128569054152, 0.6275935748500376, 0.105... and so on.

I am running a sliding window pairwise correlation approach (728 sliding windows). The number 728 denotes the last sliding window in a for loop. The output above is a paradigmatic example of the very last sliding window of df2.

Aim: I would like to run a pairwise correlation between the four subjects (between the subjects’ time-series) as follows:

  1. pairwise_cor = df2.corr(method="pearson")

However, this results in the following and empty output for pairwise_cor:

  1. Empty DataFrame
  2. Columns: []
  3. Index: []

Question:
How do I have to modify the code for pairwise_cor = df2.corr(method="pearson") so that the code does not produce an empty output?

As far as my understanding goes, the problem is based on the fact that every row contains a list or array of values. pairwise_cor = df2.corr(method="pearson") would probably work if I could transpose the dataframe so that every column corresponds to one subject, and every row to one value of the list. Is that correct? How could I modify the dataframe so change it accordingly?

答案1

得分: 2

假设以下数据框:

  1. import pandas as pd
  2. import numpy as np
  3. data = [[0.07717473, 0.90724758, 0.80752715, 0.04318562, 0.0569035, 0.12796062, 0.220677, 0.3716013, 0.74646015, -0.41114205],
  4. [-0.12252081, 0.03894384, -0.74668061, 0.00310963, 0.10716717, -0.42125924, 0.90771138, 0.10498123, 0.60872, -0.62587628],
  5. [-0.24917124, -0.76921359, 0.55519856, 0.56067116, -0.27319101, -0.01258496, 0.66428267, 0.53822299, -0.86883193, -0.15486245],
  6. [-0.14676444, 0.21910793, -0.11010598, 0.86445147, -0.92299316, -0.82828022, -0.7274392, 0.66965337, 0.67446502, -0.50343198]]
  7. df = pd.Series(data, index=[100610, 102311, 104416, 105923]).to_frame(728)
  8. print(df)
  9. # 输出
  10. 728
  11. 100610 [0.07717473, 0.90724758, 0.80752715, 0.0431856...
  12. 102311 [-0.12252081, 0.03894384, -0.74668061, 0.00310...
  13. 104416 [-0.24917124, -0.76921359, 0.55519856, 0.56067...
  14. 105923 [-0.14676444, 0.21910793, -0.11010598, 0.86445...

您可以使用 np.corrcoef 处理向量:

  1. >>> np.corrcoef(np.vstack(df.loc[:, 728]))
  2. array([[ 1. , 0.19332151, -0.23634425, 0.40882581],
  3. [ 0.19332151, 1. , -0.08039743, 0.15825333],
  4. [-0.23634425, -0.08039743, 1. , -0.02293406],
  5. [ 0.40882581, 0.15825333, -0.02293406, 1. ]])
英文:

Suppose the following dataframe:

  1. import pandas as pd
  2. import numpy as np
  3. data = [[0.07717473, 0.90724758, 0.80752715, 0.04318562, 0.0569035, 0.12796062, 0.220677, 0.3716013, 0.74646015, -0.41114205],
  4. [-0.12252081, 0.03894384, -0.74668061, 0.00310963, 0.10716717, -0.42125924, 0.90771138, 0.10498123, 0.60872, -0.62587628],
  5. [-0.24917124, -0.76921359, 0.55519856, 0.56067116, -0.27319101, -0.01258496, 0.66428267, 0.53822299, -0.86883193, -0.15486245],
  6. [-0.14676444, 0.21910793, -0.11010598, 0.86445147, -0.92299316, -0.82828022, -0.7274392, 0.66965337, 0.67446502, -0.50343198]]
  7. df = pd.Series(data, index=[100610, 102311, 104416, 105923]).to_frame(728)
  8. print(df)
  9. # Output
  10. 728
  11. 100610 [0.07717473, 0.90724758, 0.80752715, 0.0431856...
  12. 102311 [-0.12252081, 0.03894384, -0.74668061, 0.00310...
  13. 104416 [-0.24917124, -0.76921359, 0.55519856, 0.56067...
  14. 105923 [-0.14676444, 0.21910793, -0.11010598, 0.86445...

You can use np.corrcoef to deal with vectors:

  1. >>> np.corrcoef(np.vstack(df.loc[:, 728]))
  2. array([[ 1. , 0.19332151, -0.23634425, 0.40882581],
  3. [ 0.19332151, 1. , -0.08039743, 0.15825333],
  4. [-0.23634425, -0.08039743, 1. , -0.02293406],
  5. [ 0.40882581, 0.15825333, -0.02293406, 1. ]])

huangapple
  • 本文由 发表于 2023年7月10日 21:50:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76654416.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定