Anova in scipy with nan output.

huangapple go评论69阅读模式
英文:

Anova in scipy with nan output

问题

I'm trying to run the f_oneway function in scipy.

Basically, I have 3 dataframes representing respectively 3 groups, and I want to perform ANOVA among axis=1.

from scipy.stats import f_oneway
import pandas as pd
import numpy as np
group1 = {'1': {0: 574145.477641226, 1: 1570531.589742876, 2: 787929.7027375237, 3: 2570860.248729332, 4: 161008.90274193016, 5: np.nan, 6: 1027027.5447738492, 7: 10620164.126712576, 8: 3030551.86415567, 9: 6080226.794887304}, '2': {0: 5590292.274747584, 1: 2015192.4244239724, 2: 1442638.778579319, 3: 9484756.854645137, 4: 231213.53284854395, 5: 1576095.5497571388, 6: 853517.4230997175, 7: 13701076.997994969, 8: 880909.9414626792,9: 10973682.322579961}, '3': {0: 1786259.070812378, 1: 1188813.4685229606, 2: 280628.96027922264, 3: 2752454.6157454816, 4: 142423.39853381264, 5: 408643.1442709076, 6: 978859.742220046, 7: 8581569.49299859, 8: 2810091.19540494, 9: 3250847.2113601067}, '4': {0: 1423158.826220004, 1: np.nan, 2: 659142.6504867233, 3: 2727740.4095105752, 4: np.nan, 5: np.nan, 6: 166867.88656477776, 7: 15578367.076207979, 8: 1262229.6767083204, 9: 7537134.164088669}}
    
group2 = {'1': {0: 1108031.2785915325, 1: 39475.12143335618, 2: 124744.55052420696, 3: 3415955.3418994714, 4: np.nan, 5: np.nan, 6: 1185929.1264358065, 7: 14219856.696859175, 8: 107938.85576451271, 9: 9075885.57144011}, '2': {0: 3711668.7595074927, 1: np.nan, 2: 92069.12449997541, 3: 1430920.365911842, 4: 23305.980330372353, 5: 146884.88381736717, 6: 143162.52169470832, 7: 11043912.321755221,8: 1507299.549731886, 9: 6675740.20722453}, '3': {0: np.nan, 1: np.nan, 2: np.nan, 3: 192966.31343644214, 4: np.nan, 5: np.nan, 6: np.nan, 7: 13478434.128944362, 8: np.nan, 9: np.nan}, '4': {0: 6446934.0065947445, 1: 3195385.066201132, 2: 3332326.9653299027, 3: 7082529.01041953, 4: 139891.94206563127, 5: 208662.14176584402, 6: 2559284.7669506934, 7: 7395774.107780765, 8: 415796.834504837, 9: 9502289.070542539}}
group3 = {'1': {0: 5832002.081448822, 1: 2607987.6485992945, 2: 2465656.4470221293, 3: 6077038.510021252, 4: 391523.2907555177, 5: np.nan, 6: 2590061.00923242, 7: 7982067.848957288, 8: 61836.18519446156, 9: 10673885.385156194}, '2': {0: 4515593.798793708, 1: 2070893.600738691, 2: 1788619.7598766778, 3: 7302148.61285157, 4: 132247.07494014164, 5: 2130531.009443398, 6: 849079.4122880008, 7: 11086507.936560597, 8: np.nan, 9: 8977041.57285477}, '3': {0: 6916739.909404968, 1: 2886026.824106484, 2: 871822.3682870092, 3: 6515743.347648245, 4: 347767.01169986156, 5: 2975827.5336636542, 6: 3270053.676901515, 7: 9230036.81889698, 8: 4753111.521553177, 9: 11835765.28309747}, '4': {0: 8918243.297089897, 1: 2631751.3775385492, 2: 2294251.0955892503, 3: 7540353.19469351, 4: 48925.64795198818, 5: 447721.0646689915, 6: 1682494.645617865, 7: 6945276.49780706, 8: 978022.2657575278, 9: 11631856.25162219}}
groups = [group1, group2, group

<details>
<summary>英文:</summary>

I&#39;m trying to run the f_oneway function in scipy.

Basically, I have 3 dataframes representing respectively 3 groups and I want to perform ANOVA  among axis=1.

    from scipy.stats import f_oneway
    import pandas as pd
    import numpy as np
    group1 = {&#39;1&#39;: {0: 574145.477641226, 1: 1570531.589742876, 2: 787929.7027375237, 3: 2570860.248729332, 4: 161008.90274193016, 5: np.nan, 6: 1027027.5447738492, 7: 10620164.126712576, 8: 3030551.86415567, 9: 6080226.794887304}, &#39;2&#39;: {0: 5590292.274747584, 1: 2015192.4244239724, 2: 1442638.778579319, 3: 9484756.854645137, 4: 231213.53284854395, 5: 1576095.5497571388, 6: 853517.4230997175, 7: 13701076.997994969, 8: 880909.9414626792,9: 10973682.322579961}, &#39;3&#39;: {0: 1786259.070812378, 1: 1188813.4685229606, 2: 280628.96027922264, 3: 2752454.6157454816, 4: 142423.39853381264, 5: 408643.1442709076, 6: 978859.742220046, 7: 8581569.49299859, 8: 2810091.19540494, 9: 3250847.2113601067}, &#39;4&#39;: {0: 1423158.826220004, 1: np.nan, 2: 659142.6504867233, 3: 2727740.4095105752, 4: np.nan, 5: np.nan, 6: 166867.88656477776, 7: 15578367.076207979, 8: 1262229.6767083204, 9: 7537134.164088669}}
    
    group2 = {&#39;1&#39;: {0: 1108031.2785915325, 1: 39475.12143335618, 2: 124744.55052420696, 3: 3415955.3418994714, 4: np.nan, 5: np.nan, 6: 1185929.1264358065, 7: 14219856.696859175, 8: 107938.85576451271, 9: 9075885.57144011}, &#39;2&#39;: {0: 3711668.7595074927, 1: np.nan, 2: 92069.12449997541, 3: 1430920.365911842, 4: 23305.980330372353, 5: 146884.88381736717, 6: 143162.52169470832, 7: 11043912.321755221,8: 1507299.549731886, 9: 6675740.20722453}, &#39;3&#39;: {0: np.nan, 1: np.nan, 2: np.nan, 3: 192966.31343644214, 4: np.nan, 5: np.nan, 6: np.nan, 7: 13478434.128944362, 8: np.nan, 9: np.nan}, &#39;4&#39;: {0: 6446934.0065947445, 1: 3195385.066201132, 2: 3332326.9653299027, 3: 7082529.01041953, 4: 139891.94206563127, 5: 208662.14176584402, 6: 2559284.7669506934, 7: 7395774.107780765, 8: 415796.834504837, 9: 9502289.070542539}}
    group3 = {&#39;1&#39;: {0: 5832002.081448822, 1: 2607987.6485992945, 2: 2465656.4470221293, 3: 6077038.510021252, 4: 391523.2907555177, 5: np.nan, 6: 2590061.00923242, 7: 7982067.848957288, 8: 61836.18519446156, 9: 10673885.385156194}, &#39;2&#39;: {0: 4515593.798793708, 1: 2070893.600738691, 2: 1788619.7598766778, 3: 7302148.61285157, 4: 132247.07494014164, 5: 2130531.009443398, 6: 849079.4122880008, 7: 11086507.936560597, 8: np.nan, 9: 8977041.57285477}, &#39;3&#39;: {0: 6916739.909404968, 1: 2886026.824106484, 2: 871822.3682870092, 3: 6515743.347648245, 4: 347767.01169986156, 5: 2975827.5336636542, 6: 3270053.676901515, 7: 9230036.81889698, 8: 4753111.521553177, 9: 11835765.28309747}, &#39;4&#39;: {0: 8918243.297089897, 1: 2631751.3775385492, 2: 2294251.0955892503, 3: 7540353.19469351, 4: 48925.64795198818, 5: 447721.0646689915, 6: 1682494.645617865, 7: 6945276.49780706, 8: 978022.2657575278, 9: 11631856.25162219}}

    groups = [group1, group2, group3]
    data = [pd.DataFrame(x) for x in groups]
    result = f_oneway(*data, axis=1)
    result

The result output is: 

&gt; pvalue=array([nan, nan, nan, 0.17318404, nan, nan, nan, 0.24112312, nan, nan])

The nan p-value is probably due to the NaN also present in my datasets, requiring an analysis omitting the NaN. So, I tried: 

    groups = [group1, group2, group3]
    data = [pd.DataFrame(x) for x in groups]
    data_test = []
    for i in data:
         df = i.to_numpy()
         df = [x[~np.isnan(x)] for x in df]
         data_test.append(df)
    from scipy.stats import f_oneway
    result = f_oneway(*data_test, axis=1)
    result

And the output was:

&gt; ValueError: setting an array element with a sequence. The requested
&gt; array has an inhomogeneous shape after 1 dimensions. The detected
&gt; shape was (10,) + inhomogeneous part.

Someone knows how can I perform an ANOVA with same performance than scipy but ommiting NaN from original samples?


</details>


# 答案1
**得分**: 1

以下是翻译好的部分

在每个样本中您删除了每列不均匀数量的 np.NaN 值这意味着每个样本最终都变成了一个不规则的不均匀的数组通常在具有表格类型结构的缺失值中您必须执行以下操作之一

**删除整个列** - 这通常不太有帮助但取决于研究的主题

**删除整个行** - 通常不太激进但您可能需要检查剩下多少数据

**填补数据** - 使用默认值最小值滚动平均值或其他猜测值添加的某种方法进行填充

Pandas 有许多用于处理缺失值的函数还有[许多在线指南](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/)在这里我们删除所有具有缺失值的行

```python
groups = [group1, group2, group3]
data = [pd.DataFrame(x).dropna(axis='index') for x in groups]
from scipy.stats import f_oneway
result = f_oneway(*data, axis=1)
result
英文:

Within each sample, you are deleting an uneven amount of np.NaN values per column, meaning that each sample ends up as a ragged (inhomogenuous) array. usually with missing values in table type structures you must either

Drop the whole column - this is often not helpful but depends on the subject studied.

Drop the whole row - often less drastic, but you may need to check how much data you have left.

Impute data - fill with a default or minimum value or rolling average or some other method of adding guessed values.

Pandas has many functions for handling missing values and there are many online guides. Here we drop all rows with a missing value

groups = [group1, group2, group3]
data = [pd.DataFrame(x).dropna(axis=&#39;index&#39;) for x in groups]
from scipy.stats import f_oneway
result = f_oneway(*data, axis=1)
result

huangapple
  • 本文由 发表于 2023年4月1日 00:22:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900735.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定