条件数学运算与pandas数据框中的列

huangapple go评论58阅读模式
英文:

Conditional math operations with columns in a pandas dataframe

问题

我有一个包含不同值的数据框,如下所示:

Especies  Especies_0  Especies_1  Especies_2  Especies_3
2.20        3.44        1.90        1.24        0.00
2.20        3.04        2.55        0.00        0.00
1.88        2.19        0.00        0.00        0.00
2.20        3.44        2.28        2.55        0.00
3.44        2.20        0.00        0.00        0.00
2.20        2.58        0.00        0.00        0.00
1.88        2.19        0.00        0.00        0.00
3.44        1.91        3.04        1.83        3.98
3.44        2.20        0.00        0.00        0.00
2.20        2.55        1.90        0.00        0.00
1.88        2.20        0.00        0.00        0.00

我想执行的操作是:

avg(abs(max - col) for col in cols)

其中max是每行列的最大值(例如,对于第一行,max将是3.44,cols是列中其余的值),abs是绝对值函数,avg表示取平均值。

例如,对于第一行,操作将是:((3.44-2.20)+(3.44-1.90)+(3.44-1.24))/3 = 1.66

对于第5行,具有值(3.44, 2.20, 0.00, 0.00, 0.00),结果将是:(3.44 - 2.20) / 1 = 1.24

这很简单,但有一个问题,我不想考虑最大值所在的列,或者包含0.0的列(请注意,最大值列会更改,0.0的列数量也会更改)。

我已经成功处理了单个标量值,甚至编写了一个执行此操作的函数:

def ele_diff(esp0, esp1, esp2, esp3, esp4):
    species = sorted([esp0, esp1, esp2, esp3, esp4])
    diff = [species[-1] - spec for spec in species if spec != 0.0 and spec != species[-1]]
    return (sum(diff) / len(diff))

但我无法将我的函数应用于数据框。我尝试过df.apply()df.applymap(),但它们似乎不适用于我编写的函数(applymap只考虑1个输入和1个输出,而apply不会将函数分别应用于每一行,因此函数会返回ValueError,因为系列的真值是模糊的)。

我还尝试直接使用数据框来完成,但由于它具有复杂的逻辑,我无法找到解决方案。

我面临的主要问题似乎是检查要减去的值既不是0.0也不是最大值。

英文:

I have a bunch of columns in my dataframe with different values, as seen in this sample:

Especies  Especies_0  Especies_1  Especies_2  Especies_3
2.20        3.44        1.90        1.24        0.00
2.20        3.04        2.55        0.00        0.00
1.88        2.19        0.00        0.00        0.00
2.20        3.44        2.28        2.55        0.00
3.44        2.20        0.00        0.00        0.00
2.20        2.58        0.00        0.00        0.00
1.88        2.19        0.00        0.00        0.00
3.44        1.91        3.04        1.83        3.98
3.44        2.20        0.00        0.00        0.00
2.20        2.55        1.90        0.00        0.00
1.88        2.20        0.00        0.00        0.00

The operation i want to perform is:

avg(abs(max - col) for col in cols)

where max is the maximum value of the columns in each row (for example, for the first row, max would be 3.44 and cols is the rest of the values in the columns), abs is the absolute function and avg means taking the average.

For example, for the first row, the operation would be: ((3.44-2.20)+(3.44-1.90)+(3.44-1.24))/3 = 1.66

and for the 5th row, with values (3.44, 2.20, 0.00, 0.00, 0.00) the result would be: (3.44 -2.20) /1 = 1.24

This is simple enough, but there's a catch, I don't want to consider the column of the max value, or any columns with 0.0 in them (take into account that the max value column changes, it's not always the same as do the number of columns with 0.0 in them).

I have managed to do it with single, scalar values, I even did a function that does that

def ele_diff(esp0, esp1, esp2, esp3, esp4):
    species = sorted([esp0, esp1, esp2, esp3, esp4])
    diff = [species[-1] - spec for spec in species if spec != 0.0 and spec !=species[-1]]
    return (sum(diff)/len(diff))

But I'm not able to apply my function to the dataframe. I've tried df.apply() and df.applymap(), but they don't seem to work with the function I've made (applymap considers only 1 input and 1 output, while apply does not feed the function with each row separatedly, so the function returns ValueError because the truth value of a series is ambiguous).

I've also tried to do it directly with the dataframe, but as it's got complex logic, I haven't been able to come with a solution.

The main problem I've faced seems to be with checking that the values I'm going to substract are not 0.0 or the maximum.

答案1

得分: 0

以下是您提供的代码部分的翻译:

首先,通过在DataFrame.filter中以子字符串Especies开头来仅筛选必要的列,将最大值和0值设置为缺失值,然后减去最大值,转换为绝对值,最后取平均值:

df1 = df.filter(regex='^Especies')

#或者如果需要,可以在列名列表中指定列名
#cols = ['Especies', 'Especies_0', 'Especies_1', 'Especies_2', 'Especies_3']
#df1 = df[cols]

max1 = df1.max(axis=1)
mask = df1.ne(max1, axis=0) & df1.ne(0)
df['new'] = df1.where(mask).sub(max1, axis=0).abs().mean(axis=1)
print(df)

详细信息

print(df1.where(mask))
print(df1.where(mask).sub(max1, axis=0))

如果您需要进一步的解释或帮助,请随时告诉我。

英文:

First filter only necessary columns by starts substring Especies in DataFrame.filter, set missing values for maximal and 0 values, subtract maximal, convert to absolute values and last get average:

df1 = df.filter(regex='^Especies')

#or if necessary specify columns names in list
#cols = ['Especies', 'Especies_0', 'Especies_1', 'Especies_2', 'Especies_3']
#df1 = df[cols]

max1 = df1.max(axis=1)
mask = df1.ne(max1, axis=0) & df1.ne(0)
df['new'] = df1.where(mask).sub(max1, axis=0).abs().mean(axis=1)
print (df)
    Especies  Especies_0  Especies_1  Especies_2  Especies_3       new
0       2.20        3.44        1.90        1.24        0.00  1.660000
1       2.20        3.04        2.55        0.00        0.00  0.665000
2       1.88        2.19        0.00        0.00        0.00  0.310000
3       2.20        3.44        2.28        2.55        0.00  1.096667
4       3.44        2.20        0.00        0.00        0.00  1.240000
5       2.20        2.58        0.00        0.00        0.00  0.380000
6       1.88        2.19        0.00        0.00        0.00  0.310000
7       3.44        1.91        3.04        1.83        3.98  1.425000
8       3.44        2.20        0.00        0.00        0.00  1.240000
9       2.20        2.55        1.90        0.00        0.00  0.500000
10      1.88        2.20        0.00        0.00        0.00  0.320000

Details:

print (df1.where(mask))
    Especies  Especies_0  Especies_1  Especies_2  Especies_3
0       2.20         NaN        1.90        1.24         NaN
1       2.20         NaN        2.55         NaN         NaN
2       1.88         NaN         NaN         NaN         NaN
3       2.20         NaN        2.28        2.55         NaN
4        NaN        2.20         NaN         NaN         NaN
5       2.20         NaN         NaN         NaN         NaN
6       1.88         NaN         NaN         NaN         NaN
7       3.44        1.91        3.04        1.83         NaN
8        NaN        2.20         NaN         NaN         NaN
9       2.20         NaN        1.90         NaN         NaN
10      1.88         NaN         NaN         NaN         NaN

print (df1.where(mask).sub(max1, axis=0))
    Especies  Especies_0  Especies_1  Especies_2  Especies_3
0      -1.24         NaN       -1.54       -2.20         NaN
1      -0.84         NaN       -0.49         NaN         NaN
2      -0.31         NaN         NaN         NaN         NaN
3      -1.24         NaN       -1.16       -0.89         NaN
4        NaN       -1.24         NaN         NaN         NaN
5      -0.38         NaN         NaN         NaN         NaN
6      -0.31         NaN         NaN         NaN         NaN
7      -0.54       -2.07       -0.94       -2.15         NaN
8        NaN       -1.24         NaN         NaN         NaN
9      -0.35         NaN       -0.65         NaN         NaN
10     -0.32         NaN         NaN         NaN         NaN

huangapple
  • 本文由 发表于 2023年6月12日 16:20:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76454752.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定