英文:
Anova in scipy with nan output
问题
I'm trying to run the f_oneway
function in scipy.
Basically, I have 3 dataframes representing respectively 3 groups, and I want to perform ANOVA among axis=1.
from scipy.stats import f_oneway
import pandas as pd
import numpy as np
group1 = {'1': {0: 574145.477641226, 1: 1570531.589742876, 2: 787929.7027375237, 3: 2570860.248729332, 4: 161008.90274193016, 5: np.nan, 6: 1027027.5447738492, 7: 10620164.126712576, 8: 3030551.86415567, 9: 6080226.794887304}, '2': {0: 5590292.274747584, 1: 2015192.4244239724, 2: 1442638.778579319, 3: 9484756.854645137, 4: 231213.53284854395, 5: 1576095.5497571388, 6: 853517.4230997175, 7: 13701076.997994969, 8: 880909.9414626792,9: 10973682.322579961}, '3': {0: 1786259.070812378, 1: 1188813.4685229606, 2: 280628.96027922264, 3: 2752454.6157454816, 4: 142423.39853381264, 5: 408643.1442709076, 6: 978859.742220046, 7: 8581569.49299859, 8: 2810091.19540494, 9: 3250847.2113601067}, '4': {0: 1423158.826220004, 1: np.nan, 2: 659142.6504867233, 3: 2727740.4095105752, 4: np.nan, 5: np.nan, 6: 166867.88656477776, 7: 15578367.076207979, 8: 1262229.6767083204, 9: 7537134.164088669}}
group2 = {'1': {0: 1108031.2785915325, 1: 39475.12143335618, 2: 124744.55052420696, 3: 3415955.3418994714, 4: np.nan, 5: np.nan, 6: 1185929.1264358065, 7: 14219856.696859175, 8: 107938.85576451271, 9: 9075885.57144011}, '2': {0: 3711668.7595074927, 1: np.nan, 2: 92069.12449997541, 3: 1430920.365911842, 4: 23305.980330372353, 5: 146884.88381736717, 6: 143162.52169470832, 7: 11043912.321755221,8: 1507299.549731886, 9: 6675740.20722453}, '3': {0: np.nan, 1: np.nan, 2: np.nan, 3: 192966.31343644214, 4: np.nan, 5: np.nan, 6: np.nan, 7: 13478434.128944362, 8: np.nan, 9: np.nan}, '4': {0: 6446934.0065947445, 1: 3195385.066201132, 2: 3332326.9653299027, 3: 7082529.01041953, 4: 139891.94206563127, 5: 208662.14176584402, 6: 2559284.7669506934, 7: 7395774.107780765, 8: 415796.834504837, 9: 9502289.070542539}}
group3 = {'1': {0: 5832002.081448822, 1: 2607987.6485992945, 2: 2465656.4470221293, 3: 6077038.510021252, 4: 391523.2907555177, 5: np.nan, 6: 2590061.00923242, 7: 7982067.848957288, 8: 61836.18519446156, 9: 10673885.385156194}, '2': {0: 4515593.798793708, 1: 2070893.600738691, 2: 1788619.7598766778, 3: 7302148.61285157, 4: 132247.07494014164, 5: 2130531.009443398, 6: 849079.4122880008, 7: 11086507.936560597, 8: np.nan, 9: 8977041.57285477}, '3': {0: 6916739.909404968, 1: 2886026.824106484, 2: 871822.3682870092, 3: 6515743.347648245, 4: 347767.01169986156, 5: 2975827.5336636542, 6: 3270053.676901515, 7: 9230036.81889698, 8: 4753111.521553177, 9: 11835765.28309747}, '4': {0: 8918243.297089897, 1: 2631751.3775385492, 2: 2294251.0955892503, 3: 7540353.19469351, 4: 48925.64795198818, 5: 447721.0646689915, 6: 1682494.645617865, 7: 6945276.49780706, 8: 978022.2657575278, 9: 11631856.25162219}}
groups = [group1, group2, group
<details>
<summary>英文:</summary>
I'm trying to run the f_oneway function in scipy.
Basically, I have 3 dataframes representing respectively 3 groups and I want to perform ANOVA among axis=1.
from scipy.stats import f_oneway
import pandas as pd
import numpy as np
group1 = {'1': {0: 574145.477641226, 1: 1570531.589742876, 2: 787929.7027375237, 3: 2570860.248729332, 4: 161008.90274193016, 5: np.nan, 6: 1027027.5447738492, 7: 10620164.126712576, 8: 3030551.86415567, 9: 6080226.794887304}, '2': {0: 5590292.274747584, 1: 2015192.4244239724, 2: 1442638.778579319, 3: 9484756.854645137, 4: 231213.53284854395, 5: 1576095.5497571388, 6: 853517.4230997175, 7: 13701076.997994969, 8: 880909.9414626792,9: 10973682.322579961}, '3': {0: 1786259.070812378, 1: 1188813.4685229606, 2: 280628.96027922264, 3: 2752454.6157454816, 4: 142423.39853381264, 5: 408643.1442709076, 6: 978859.742220046, 7: 8581569.49299859, 8: 2810091.19540494, 9: 3250847.2113601067}, '4': {0: 1423158.826220004, 1: np.nan, 2: 659142.6504867233, 3: 2727740.4095105752, 4: np.nan, 5: np.nan, 6: 166867.88656477776, 7: 15578367.076207979, 8: 1262229.6767083204, 9: 7537134.164088669}}
group2 = {'1': {0: 1108031.2785915325, 1: 39475.12143335618, 2: 124744.55052420696, 3: 3415955.3418994714, 4: np.nan, 5: np.nan, 6: 1185929.1264358065, 7: 14219856.696859175, 8: 107938.85576451271, 9: 9075885.57144011}, '2': {0: 3711668.7595074927, 1: np.nan, 2: 92069.12449997541, 3: 1430920.365911842, 4: 23305.980330372353, 5: 146884.88381736717, 6: 143162.52169470832, 7: 11043912.321755221,8: 1507299.549731886, 9: 6675740.20722453}, '3': {0: np.nan, 1: np.nan, 2: np.nan, 3: 192966.31343644214, 4: np.nan, 5: np.nan, 6: np.nan, 7: 13478434.128944362, 8: np.nan, 9: np.nan}, '4': {0: 6446934.0065947445, 1: 3195385.066201132, 2: 3332326.9653299027, 3: 7082529.01041953, 4: 139891.94206563127, 5: 208662.14176584402, 6: 2559284.7669506934, 7: 7395774.107780765, 8: 415796.834504837, 9: 9502289.070542539}}
group3 = {'1': {0: 5832002.081448822, 1: 2607987.6485992945, 2: 2465656.4470221293, 3: 6077038.510021252, 4: 391523.2907555177, 5: np.nan, 6: 2590061.00923242, 7: 7982067.848957288, 8: 61836.18519446156, 9: 10673885.385156194}, '2': {0: 4515593.798793708, 1: 2070893.600738691, 2: 1788619.7598766778, 3: 7302148.61285157, 4: 132247.07494014164, 5: 2130531.009443398, 6: 849079.4122880008, 7: 11086507.936560597, 8: np.nan, 9: 8977041.57285477}, '3': {0: 6916739.909404968, 1: 2886026.824106484, 2: 871822.3682870092, 3: 6515743.347648245, 4: 347767.01169986156, 5: 2975827.5336636542, 6: 3270053.676901515, 7: 9230036.81889698, 8: 4753111.521553177, 9: 11835765.28309747}, '4': {0: 8918243.297089897, 1: 2631751.3775385492, 2: 2294251.0955892503, 3: 7540353.19469351, 4: 48925.64795198818, 5: 447721.0646689915, 6: 1682494.645617865, 7: 6945276.49780706, 8: 978022.2657575278, 9: 11631856.25162219}}
groups = [group1, group2, group3]
data = [pd.DataFrame(x) for x in groups]
result = f_oneway(*data, axis=1)
result
The result output is:
> pvalue=array([nan, nan, nan, 0.17318404, nan, nan, nan, 0.24112312, nan, nan])
The nan p-value is probably due to the NaN also present in my datasets, requiring an analysis omitting the NaN. So, I tried:
groups = [group1, group2, group3]
data = [pd.DataFrame(x) for x in groups]
data_test = []
for i in data:
df = i.to_numpy()
df = [x[~np.isnan(x)] for x in df]
data_test.append(df)
from scipy.stats import f_oneway
result = f_oneway(*data_test, axis=1)
result
And the output was:
> ValueError: setting an array element with a sequence. The requested
> array has an inhomogeneous shape after 1 dimensions. The detected
> shape was (10,) + inhomogeneous part.
Someone knows how can I perform an ANOVA with same performance than scipy but ommiting NaN from original samples?
</details>
# 答案1
**得分**: 1
以下是翻译好的部分:
在每个样本中,您删除了每列不均匀数量的 np.NaN 值,这意味着每个样本最终都变成了一个不规则的(不均匀的)数组。通常,在具有表格类型结构的缺失值中,您必须执行以下操作之一:
**删除整个列** - 这通常不太有帮助,但取决于研究的主题。
**删除整个行** - 通常不太激进,但您可能需要检查剩下多少数据。
**填补数据** - 使用默认值、最小值、滚动平均值或其他猜测值添加的某种方法进行填充。
Pandas 有许多用于处理缺失值的函数,还有[许多在线指南](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/)。在这里,我们删除所有具有缺失值的行。
```python
groups = [group1, group2, group3]
data = [pd.DataFrame(x).dropna(axis='index') for x in groups]
from scipy.stats import f_oneway
result = f_oneway(*data, axis=1)
result
英文:
Within each sample, you are deleting an uneven amount of np.NaN values per column, meaning that each sample ends up as a ragged (inhomogenuous) array. usually with missing values in table type structures you must either
Drop the whole column - this is often not helpful but depends on the subject studied.
Drop the whole row - often less drastic, but you may need to check how much data you have left.
Impute data - fill with a default or minimum value or rolling average or some other method of adding guessed values.
Pandas has many functions for handling missing values and there are many online guides. Here we drop all rows with a missing value
groups = [group1, group2, group3]
data = [pd.DataFrame(x).dropna(axis='index') for x in groups]
from scipy.stats import f_oneway
result = f_oneway(*data, axis=1)
result
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论