在Pandas DataFrame的`assign`方法内使用`if`语句。

huangapple go评论71阅读模式
英文:

Using an `if` statement inside a Pandas DataFrame's `assign` method

问题

我正在尝试在需要使用if/else语句检查条件的几列上执行操作时遇到困难。

更具体地说,我正在尝试在Pandas Dataframe的assign方法的范围内执行此检查。以下是我尝试执行的示例:

# 导入Pandas
import pandas as pd

# 创建合成数据
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                      'col2':[11,22,33,44,55,66,77,88,99,1010]})

# 创建一个单独的输出DataFrame,不覆盖原始输入DataFrame
out_df = my_df.assign(
    # 使用lambda函数成功创建一个名为`col3`的新列
    col3=lambda row: row['col1'] + row['col2'],

    # 使用新的lambda函数对新生成的列执行操作。
    bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')

上面的代码会产生ValueError

ValueError: The truth value of a Series is ambiguous

在尝试调查错误时,我发现了这个Stack Overflow线程。看起来lambda函数在DataFrame中不始终与条件逻辑一起很好地工作,主要是由于DataFrame尝试将事物处理为Series。

一些不太干净的解决方法

使用apply
一种不太干净的解决方法是如上所示使用assign方法创建col3,然后使用apply方法创建bleep_bloop列:

out_sr = (my_df.assign(
    col3=lambda row: row['col1'] + row['col2'])
    .apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1))

这里的问题是上面的代码仅返回bleep_bloop列的结果,而不是具有col3bleep_bloop的新DataFrame。

即时执行与多个命令

另一种方法是将一个命令拆分为两个:

out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1)

这也可以工作,但我真的想尽可能坚持即时执行的方法,其中我可以在一个链接的命令中执行所有操作。

回到主要问题

考虑到我上面显示的解决方法混乱且不能完全满足我的需求,是否有其他方法可以创建基于条件if/else语句的新列?

我在这里给出的示例非常简单,但请考虑实际应用可能涉及应用自定义函数(例如:out_df=my_df.assign(new_col=lambda row: my_func(row)),其中my_func是一个使用同一行的多个其他列作为输入的复杂函数)。

英文:

Intro and reproducible code snippet

I'm having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else statement.

More specifically, I'm trying to perform this check within the confines of the assign method of a Pandas Dataframe. Here is an example of what I'm trying to do

# Importing Pandas
import pandas as pd

# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                      'col2':[11,22,33,44,55,66,77,88,99,1010]})

# Creating a separate output DataFrame that doesn't overwrite 
# the original input DataFrame
out_df = my_df.assign(
    # Successfully creating a new column called `col3` using a lambda function
    col3=lambda row: row['col1'] + row['col2'],

    # Using a new lambda function to perform an operation on the newly 
    # generated column. 
    bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')

The code above yeilds a ValueError:

ValueError: The truth value of a Series is ambiguous

When trying to investigate the error, I found this SO thread. It seems that lambda functions don't always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame's attempt to deal with things as Series.

A few dirty workarounds

Use apply

A dirty workaround would be to make col3 using the assign method as indicated above, but then create the bleep_bloop column using an apply method instead:

out_sr = (my_df.assign(
    col3=lambda row: row['col1'] + row['col2'])
    .apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1))

The problem here is that the code above returns only a Series with the results of the bleep_bloop column instead of a new DataFrame with both col3 and bleep_bloop.

On the fly vs. multiple commands

Yet another approach would be to break one command into two:

out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1)

This also works, but I'd really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.

Back to the main question

Given that the workarounds I showed above are messy and don't really get the job done like I need, is there any other way I can create a new column that's based on using a conditional if/else statement?

The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row)), where my_func is some complex function that uses several other columns from the same row as inputs).

答案1

得分: 6

你的错误在于你认为lambda函数作用于行,实际上它以矢量化方式作用于整个列。你需要使用矢量化函数:

import numpy as np

out_df = my_df.assign(
    col3=lambda d: d['col1'] + d['col2'],
    bleep_bloop=lambda d: np.where(d['col3'] % 8, 'bloop', 'bleep')
)

print(out_df)

输出结果:

   col1  col2  col3 bleep_bloop
0     1    11    12       bloop
1     2    22    24       bleep
2     3    33    36       bloop
3     4    44    48       bleep
4     5    55    60       bloop
5     6    66    72       bleep
6     7    77    84       bloop
7     8    88    96       bleep
8     9    99   108       bloop
9    10  1010  1020       bloop
英文:

Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:

import numpy as np

out_df = my_df.assign(
    col3=lambda d: d['col1'] + d['col2'],
    bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
)

print(out_df)

Output:

   col1  col2  col3 bleep_bloop
0     1    11    12       bloop
1     2    22    24       bleep
2     3    33    36       bloop
3     4    44    48       bleep
4     5    55    60       bloop
5     6    66    72       bleep
6     7    77    84       bloop
7     8    88    96       bleep
8     9    99   108       bloop
9    10  1010  1020       bloop

答案2

得分: 1

或者对于超过2个条件,您可以使用np.select:

import numpy as np
out_df = (my_df.assign(
    col3=lambda df_: df_['col1'] + df_['col2'],
    bleep_bloop=lambda df_: np.select(condlist=[df_['col3'] % 8 == 0,
                                               df_['col3'] % 8 == 1,
                                               df_['col3'] > 100],
                                     choicelist=['bleep',
                                                 'bloop',
                                                 'bliip'],
                                     default='bluup')))

np.select的好处是它像where(向量化函数,因此更快),您可以添加任意多个条件。

英文:

Or for more than 2 conditions you can use np.select:

import numpy as np  
out_df=(my_df.assign(
    col3 = lambda df_ : df_['col1'] + df_['col2'],
    bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
                                                df_['col3']%8==1,
                                                df_['col3']>100 ],
                                      choicelist=['bleep',
                                                  'bloop',
                                                  'bliip'],
                                      default='bluup')))

The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.

答案3

得分: 0

由于您的最终列需要复杂的逻辑,正如您所提到的,因此创建一个单独的函数并将其应用于行是有道理的。

def my_func(x):
    if (x['col1'] + x['col2']) % 8 == 0:
        return 'bleep'
    else:
        return 'bloop'

my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)

当您将 x 传递给函数时,实际上是将每一行传递给它,可以在函数内部使用 x['col1'] 等列的值。这样,您可以创建任何您需要的复杂函数。请注意,这里需要使用 axis=1 来传递行。

我没有包含创建 col3,仅提供一个示例。

英文:

Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.

def my_func(x):
    if (x['col1'] + x['col2']) % 8 == 0:
        return 'bleep'
    else:
        return 'bloop'

my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)

When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x['col1'] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.

I did not include creation of col3 just to provide a sample.

huangapple
  • 本文由 发表于 2023年2月10日 05:30:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404602.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定