英文:
Using an `if` statement inside a Pandas DataFrame's `assign` method
问题
我正在尝试在需要使用if/else
语句检查条件的几列上执行操作时遇到困难。
更具体地说,我正在尝试在Pandas Dataframe的assign
方法的范围内执行此检查。以下是我尝试执行的示例:
# 导入Pandas
import pandas as pd
# 创建合成数据
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[11,22,33,44,55,66,77,88,99,1010]})
# 创建一个单独的输出DataFrame,不覆盖原始输入DataFrame
out_df = my_df.assign(
# 使用lambda函数成功创建一个名为`col3`的新列
col3=lambda row: row['col1'] + row['col2'],
# 使用新的lambda函数对新生成的列执行操作。
bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')
上面的代码会产生ValueError
:
ValueError: The truth value of a Series is ambiguous
在尝试调查错误时,我发现了这个Stack Overflow线程。看起来lambda
函数在DataFrame中不始终与条件逻辑一起很好地工作,主要是由于DataFrame尝试将事物处理为Series。
一些不太干净的解决方法
使用apply
一种不太干净的解决方法是如上所示使用assign
方法创建col3
,然后使用apply
方法创建bleep_bloop
列:
out_sr = (my_df.assign(
col3=lambda row: row['col1'] + row['col2'])
.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1))
这里的问题是上面的代码仅返回bleep_bloop
列的结果,而不是具有col3
和bleep_bloop
的新DataFrame。
即时执行与多个命令
另一种方法是将一个命令拆分为两个:
out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1)
这也可以工作,但我真的想尽可能坚持即时执行的方法,其中我可以在一个链接的命令中执行所有操作。
回到主要问题
考虑到我上面显示的解决方法混乱且不能完全满足我的需求,是否有其他方法可以创建基于条件if/else
语句的新列?
我在这里给出的示例非常简单,但请考虑实际应用可能涉及应用自定义函数(例如:out_df=my_df.assign(new_col=lambda row: my_func(row))
,其中my_func
是一个使用同一行的多个其他列作为输入的复杂函数)。
英文:
Intro and reproducible code snippet
I'm having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else
statement.
More specifically, I'm trying to perform this check within the confines of the assign
method of a Pandas Dataframe. Here is an example of what I'm trying to do
# Importing Pandas
import pandas as pd
# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[11,22,33,44,55,66,77,88,99,1010]})
# Creating a separate output DataFrame that doesn't overwrite
# the original input DataFrame
out_df = my_df.assign(
# Successfully creating a new column called `col3` using a lambda function
col3=lambda row: row['col1'] + row['col2'],
# Using a new lambda function to perform an operation on the newly
# generated column.
bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')
The code above yeilds a ValueError
:
ValueError: The truth value of a Series is ambiguous
When trying to investigate the error, I found this SO thread. It seems that lambda
functions don't always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame's attempt to deal with things as Series.
A few dirty workarounds
Use apply
A dirty workaround would be to make col3
using the assign
method as indicated above, but then create the bleep_bloop
column using an apply
method instead:
out_sr = (my_df.assign(
col3=lambda row: row['col1'] + row['col2'])
.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1))
The problem here is that the code above returns only a Series with the results of the bleep_bloop
column instead of a new DataFrame with both col3
and bleep_bloop
.
On the fly vs. multiple commands
Yet another approach would be to break one command into two:
out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1)
This also works, but I'd really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.
Back to the main question
Given that the workarounds I showed above are messy and don't really get the job done like I need, is there any other way I can create a new column that's based on using a conditional if/else
statement?
The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row))
, where my_func
is some complex function that uses several other columns from the same row as inputs).
答案1
得分: 6
你的错误在于你认为lambda函数作用于行,实际上它以矢量化方式作用于整个列。你需要使用矢量化函数:
import numpy as np
out_df = my_df.assign(
col3=lambda d: d['col1'] + d['col2'],
bleep_bloop=lambda d: np.where(d['col3'] % 8, 'bloop', 'bleep')
)
print(out_df)
输出结果:
col1 col2 col3 bleep_bloop
0 1 11 12 bloop
1 2 22 24 bleep
2 3 33 36 bloop
3 4 44 48 bleep
4 5 55 60 bloop
5 6 66 72 bleep
6 7 77 84 bloop
7 8 88 96 bleep
8 9 99 108 bloop
9 10 1010 1020 bloop
英文:
Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:
import numpy as np
out_df = my_df.assign(
col3=lambda d: d['col1'] + d['col2'],
bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
)
print(out_df)
Output:
col1 col2 col3 bleep_bloop
0 1 11 12 bloop
1 2 22 24 bleep
2 3 33 36 bloop
3 4 44 48 bleep
4 5 55 60 bloop
5 6 66 72 bleep
6 7 77 84 bloop
7 8 88 96 bleep
8 9 99 108 bloop
9 10 1010 1020 bloop
答案2
得分: 1
或者对于超过2个条件,您可以使用np.select:
import numpy as np
out_df = (my_df.assign(
col3=lambda df_: df_['col1'] + df_['col2'],
bleep_bloop=lambda df_: np.select(condlist=[df_['col3'] % 8 == 0,
df_['col3'] % 8 == 1,
df_['col3'] > 100],
choicelist=['bleep',
'bloop',
'bliip'],
default='bluup')))
np.select的好处是它像where(向量化函数,因此更快),您可以添加任意多个条件。
英文:
Or for more than 2 conditions you can use np.select:
import numpy as np
out_df=(my_df.assign(
col3 = lambda df_ : df_['col1'] + df_['col2'],
bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
df_['col3']%8==1,
df_['col3']>100 ],
choicelist=['bleep',
'bloop',
'bliip'],
default='bluup')))
The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.
答案3
得分: 0
由于您的最终列需要复杂的逻辑,正如您所提到的,因此创建一个单独的函数并将其应用于行是有道理的。
def my_func(x):
if (x['col1'] + x['col2']) % 8 == 0:
return 'bleep'
else:
return 'bloop'
my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)
当您将 x 传递给函数时,实际上是将每一行传递给它,可以在函数内部使用 x['col1'] 等列的值。这样,您可以创建任何您需要的复杂函数。请注意,这里需要使用 axis=1 来传递行。
我没有包含创建 col3,仅提供一个示例。
英文:
Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.
def my_func(x):
if (x['col1'] + x['col2']) % 8 == 0:
return 'bleep'
else:
return 'bloop'
my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)
When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x['col1'] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.
I did not include creation of col3 just to provide a sample.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论