2023年2月10日 05:30:28go评论54阅读模式

英文:

Using an `if` statement inside a Pandas DataFrame's `assign` method

问题

我正在尝试在需要使用if/else语句检查条件的几列上执行操作时遇到困难。

更具体地说，我正在尝试在Pandas Dataframe的assign方法的范围内执行此检查。以下是我尝试执行的示例：

# 导入Pandas
import pandas as pd

# 创建合成数据
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                      'col2':[11,22,33,44,55,66,77,88,99,1010]})

# 创建一个单独的输出DataFrame，不覆盖原始输入DataFrame
out_df = my_df.assign(
    # 使用lambda函数成功创建一个名为`col3`的新列
    col3=lambda row: row['col1'] + row['col2'],

    # 使用新的lambda函数对新生成的列执行操作。
    bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')

上面的代码会产生ValueError：

ValueError: The truth value of a Series is ambiguous

在尝试调查错误时，我发现了这个Stack Overflow线程。看起来lambda函数在DataFrame中不始终与条件逻辑一起很好地工作，主要是由于DataFrame尝试将事物处理为Series。

一些不太干净的解决方法

使用apply
一种不太干净的解决方法是如上所示使用assign方法创建col3，然后使用apply方法创建bleep_bloop列：

out_sr = (my_df.assign(
    col3=lambda row: row['col1'] + row['col2'])
    .apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1))

这里的问题是上面的代码仅返回bleep_bloop列的结果，而不是具有col3和bleep_bloop的新DataFrame。

即时执行与多个命令

另一种方法是将一个命令拆分为两个：

out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1)

这也可以工作，但我真的想尽可能坚持即时执行的方法，其中我可以在一个链接的命令中执行所有操作。

回到主要问题

考虑到我上面显示的解决方法混乱且不能完全满足我的需求，是否有其他方法可以创建基于条件if/else语句的新列？

我在这里给出的示例非常简单，但请考虑实际应用可能涉及应用自定义函数（例如：out_df=my_df.assign(new_col=lambda row: my_func(row))，其中my_func是一个使用同一行的多个其他列作为输入的复杂函数）。

英文:

Intro and reproducible code snippet

I'm having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else statement.

More specifically, I'm trying to perform this check within the confines of the assign method of a Pandas Dataframe. Here is an example of what I'm trying to do

# Importing Pandas
import pandas as pd

# Creating synthetic data
my_df = pd.DataFrame({&#39;col1&#39;:[1,2,3,4,5,6,7,8,9,10],
                      &#39;col2&#39;:[11,22,33,44,55,66,77,88,99,1010]})

# Creating a separate output DataFrame that doesn&#39;t overwrite 
# the original input DataFrame
out_df = my_df.assign(
    # Successfully creating a new column called `col3` using a lambda function
    col3=lambda row: row[&#39;col1&#39;] + row[&#39;col2&#39;],

    # Using a new lambda function to perform an operation on the newly 
    # generated column. 
    bleep_bloop=lambda row: &#39;bleep&#39; if (row[&#39;col3&#39;]%8 == 0) else &#39;bloop&#39;)

The code above yeilds a ValueError:

ValueError: The truth value of a Series is ambiguous

When trying to investigate the error, I found this SO thread. It seems that lambda functions don't always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame's attempt to deal with things as Series.

A few dirty workarounds

Use `apply`

A dirty workaround would be to make col3 using the assign method as indicated above, but then create the bleep_bloop column using an apply method instead:

out_sr = (my_df.assign(
    col3=lambda row: row[&#39;col1&#39;] + row[&#39;col2&#39;])
    .apply(lambda row: &#39;bleep&#39; if (row[&#39;col3&#39;]%8 == 0) 
                               else &#39;bloop&#39;, axis=1))

The problem here is that the code above returns only a Series with the results of the bleep_bloop column instead of a new DataFrame with both col3 and bleep_bloop.

On the fly vs. multiple commands

Yet another approach would be to break one command into two:

out_df_2 = (my_df.assign(col3=lambda row: row[&#39;col1&#39;] + row[&#39;col2&#39;]))
out_df_2[&#39;bleep_bloop&#39;] = out_df_2.apply(lambda row: &#39;bleep&#39; if (row[&#39;col3&#39;]%8 == 0) 
                               else &#39;bloop&#39;, axis=1)

This also works, but I'd really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.

Back to the main question

Given that the workarounds I showed above are messy and don't really get the job done like I need, is there any other way I can create a new column that's based on using a conditional if/else statement?

The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row)), where my_func is some complex function that uses several other columns from the same row as inputs).

答案1

得分: 6

你的错误在于你认为lambda函数作用于行，实际上它以矢量化方式作用于整个列。你需要使用矢量化函数：

import numpy as np

out_df = my_df.assign(
    col3=lambda d: d['col1'] + d['col2'],
    bleep_bloop=lambda d: np.where(d['col3'] % 8, 'bloop', 'bleep')
)

print(out_df)

输出结果：

   col1  col2  col3 bleep_bloop
0     1    11    12       bloop
1     2    22    24       bleep
2     3    33    36       bloop
3     4    44    48       bleep
4     5    55    60       bloop
5     6    66    72       bleep
6     7    77    84       bloop
7     8    88    96       bleep
8     9    99   108       bloop
9    10  1010  1020       bloop

英文:

Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:

import numpy as np

out_df = my_df.assign(
    col3=lambda d: d[&#39;col1&#39;] + d[&#39;col2&#39;],
    bleep_bloop=lambda d: np.where(d[&#39;col3&#39;]%8, &#39;bloop&#39;, &#39;bleep&#39;)
)

print(out_df)

Output:

   col1  col2  col3 bleep_bloop
0     1    11    12       bloop
1     2    22    24       bleep
2     3    33    36       bloop
3     4    44    48       bleep
4     5    55    60       bloop
5     6    66    72       bleep
6     7    77    84       bloop
7     8    88    96       bleep
8     9    99   108       bloop
9    10  1010  1020       bloop

答案2

得分: 1

或者对于超过2个条件，您可以使用np.select：

import numpy as np
out_df = (my_df.assign(
    col3=lambda df_: df_['col1'] + df_['col2'],
    bleep_bloop=lambda df_: np.select(condlist=[df_['col3'] % 8 == 0,
                                               df_['col3'] % 8 == 1,
                                               df_['col3'] > 100],
                                     choicelist=['bleep',
                                                 'bloop',
                                                 'bliip'],
                                     default='bluup')))

np.select的好处是它像where（向量化函数，因此更快），您可以添加任意多个条件。

英文:

Or for more than 2 conditions you can use np.select:

import numpy as np  
out_df=(my_df.assign(
    col3 = lambda df_ : df_[&#39;col1&#39;] + df_[&#39;col2&#39;],
    bleep_bloop=lambda df_: np.select(condlist=[df_[&#39;col3&#39;]%8==0,
                                                df_[&#39;col3&#39;]%8==1,
                                                df_[&#39;col3&#39;]&gt;100 ],
                                      choicelist=[&#39;bleep&#39;,
                                                  &#39;bloop&#39;,
                                                  &#39;bliip&#39;],
                                      default=&#39;bluup&#39;)))

The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.

答案3

得分: 0

由于您的最终列需要复杂的逻辑，正如您所提到的，因此创建一个单独的函数并将其应用于行是有道理的。

def my_func(x):
    if (x['col1'] + x['col2']) % 8 == 0:
        return 'bleep'
    else:
        return 'bloop'

my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)

当您将 x 传递给函数时，实际上是将每一行传递给它，可以在函数内部使用 x['col1'] 等列的值。这样，您可以创建任何您需要的复杂函数。请注意，这里需要使用 axis=1 来传递行。

我没有包含创建 col3，仅提供一个示例。

英文:

Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.

def my_func(x):
    if (x[&#39;col1&#39;] + x[&#39;col2&#39;]) % 8 == 0:
        return &#39;bleep&#39;
    else:
        return &#39;bloop&#39;

my_df[&#39;bleep_bloop&#39;] = my_df.apply(lambda x: my_func(x), axis=1)

When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x['col1'] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.

I did not include creation of col3 just to provide a sample.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas DataFrame的`assign`方法内使用`if`语句。

问题

Intro and reproducible code snippet

A few dirty workarounds

Use `apply`

On the fly vs. multiple commands

Back to the main question

答案1

答案2

答案3

在Python中如何找到扩展名为*.nmconnection的多个文件？

Powershell 基于另一列添加成员/列

FileNotFoundError: [Errno 2] No such file or directory: while exporting a parquet file from pandas dataframe

对一组起始和结束点字符串进行排序

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

Intro and reproducible code snippet

A few dirty workarounds

Use apply

On the fly vs. multiple commands

Back to the main question

答案1

答案2

答案3

发表评论

Use `apply`