在Pandas DataFrame的`assign`方法内使用`if`语句。

huangapple go评论100阅读模式
英文:

Using an `if` statement inside a Pandas DataFrame's `assign` method

问题

我正在尝试在需要使用if/else语句检查条件的几列上执行操作时遇到困难。

更具体地说,我正在尝试在Pandas Dataframe的assign方法的范围内执行此检查。以下是我尝试执行的示例:

  1. # 导入Pandas
  2. import pandas as pd
  3. # 创建合成数据
  4. my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
  5. 'col2':[11,22,33,44,55,66,77,88,99,1010]})
  6. # 创建一个单独的输出DataFrame,不覆盖原始输入DataFrame
  7. out_df = my_df.assign(
  8. # 使用lambda函数成功创建一个名为`col3`的新列
  9. col3=lambda row: row['col1'] + row['col2'],
  10. # 使用新的lambda函数对新生成的列执行操作。
  11. bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')

上面的代码会产生ValueError

ValueError: The truth value of a Series is ambiguous

在尝试调查错误时,我发现了这个Stack Overflow线程。看起来lambda函数在DataFrame中不始终与条件逻辑一起很好地工作,主要是由于DataFrame尝试将事物处理为Series。

一些不太干净的解决方法

使用apply
一种不太干净的解决方法是如上所示使用assign方法创建col3,然后使用apply方法创建bleep_bloop列:

  1. out_sr = (my_df.assign(
  2. col3=lambda row: row['col1'] + row['col2'])
  3. .apply(lambda row: 'bleep' if (row['col3']%8 == 0)
  4. else 'bloop', axis=1))

这里的问题是上面的代码仅返回bleep_bloop列的结果,而不是具有col3bleep_bloop的新DataFrame。

即时执行与多个命令

另一种方法是将一个命令拆分为两个:

  1. out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
  2. out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
  3. else 'bloop', axis=1)

这也可以工作,但我真的想尽可能坚持即时执行的方法,其中我可以在一个链接的命令中执行所有操作。

回到主要问题

考虑到我上面显示的解决方法混乱且不能完全满足我的需求,是否有其他方法可以创建基于条件if/else语句的新列?

我在这里给出的示例非常简单,但请考虑实际应用可能涉及应用自定义函数(例如:out_df=my_df.assign(new_col=lambda row: my_func(row)),其中my_func是一个使用同一行的多个其他列作为输入的复杂函数)。

英文:

Intro and reproducible code snippet

I'm having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else statement.

More specifically, I'm trying to perform this check within the confines of the assign method of a Pandas Dataframe. Here is an example of what I'm trying to do

  1. # Importing Pandas
  2. import pandas as pd
  3. # Creating synthetic data
  4. my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
  5. 'col2':[11,22,33,44,55,66,77,88,99,1010]})
  6. # Creating a separate output DataFrame that doesn't overwrite
  7. # the original input DataFrame
  8. out_df = my_df.assign(
  9. # Successfully creating a new column called `col3` using a lambda function
  10. col3=lambda row: row['col1'] + row['col2'],
  11. # Using a new lambda function to perform an operation on the newly
  12. # generated column.
  13. bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')

The code above yeilds a ValueError:

ValueError: The truth value of a Series is ambiguous

When trying to investigate the error, I found this SO thread. It seems that lambda functions don't always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame's attempt to deal with things as Series.

A few dirty workarounds

Use apply

A dirty workaround would be to make col3 using the assign method as indicated above, but then create the bleep_bloop column using an apply method instead:

  1. out_sr = (my_df.assign(
  2. col3=lambda row: row['col1'] + row['col2'])
  3. .apply(lambda row: 'bleep' if (row['col3']%8 == 0)
  4. else 'bloop', axis=1))

The problem here is that the code above returns only a Series with the results of the bleep_bloop column instead of a new DataFrame with both col3 and bleep_bloop.

On the fly vs. multiple commands

Yet another approach would be to break one command into two:

  1. out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
  2. out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
  3. else 'bloop', axis=1)

This also works, but I'd really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.

Back to the main question

Given that the workarounds I showed above are messy and don't really get the job done like I need, is there any other way I can create a new column that's based on using a conditional if/else statement?

The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row)), where my_func is some complex function that uses several other columns from the same row as inputs).

答案1

得分: 6

你的错误在于你认为lambda函数作用于行,实际上它以矢量化方式作用于整个列。你需要使用矢量化函数:

  1. import numpy as np
  2. out_df = my_df.assign(
  3. col3=lambda d: d['col1'] + d['col2'],
  4. bleep_bloop=lambda d: np.where(d['col3'] % 8, 'bloop', 'bleep')
  5. )
  6. print(out_df)

输出结果:

  1. col1 col2 col3 bleep_bloop
  2. 0 1 11 12 bloop
  3. 1 2 22 24 bleep
  4. 2 3 33 36 bloop
  5. 3 4 44 48 bleep
  6. 4 5 55 60 bloop
  7. 5 6 66 72 bleep
  8. 6 7 77 84 bloop
  9. 7 8 88 96 bleep
  10. 8 9 99 108 bloop
  11. 9 10 1010 1020 bloop
英文:

Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:

  1. import numpy as np
  2. out_df = my_df.assign(
  3. col3=lambda d: d['col1'] + d['col2'],
  4. bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
  5. )
  6. print(out_df)

Output:

  1. col1 col2 col3 bleep_bloop
  2. 0 1 11 12 bloop
  3. 1 2 22 24 bleep
  4. 2 3 33 36 bloop
  5. 3 4 44 48 bleep
  6. 4 5 55 60 bloop
  7. 5 6 66 72 bleep
  8. 6 7 77 84 bloop
  9. 7 8 88 96 bleep
  10. 8 9 99 108 bloop
  11. 9 10 1010 1020 bloop

答案2

得分: 1

或者对于超过2个条件,您可以使用np.select:

  1. import numpy as np
  2. out_df = (my_df.assign(
  3. col3=lambda df_: df_['col1'] + df_['col2'],
  4. bleep_bloop=lambda df_: np.select(condlist=[df_['col3'] % 8 == 0,
  5. df_['col3'] % 8 == 1,
  6. df_['col3'] > 100],
  7. choicelist=['bleep',
  8. 'bloop',
  9. 'bliip'],
  10. default='bluup')))

np.select的好处是它像where(向量化函数,因此更快),您可以添加任意多个条件。

英文:

Or for more than 2 conditions you can use np.select:

  1. import numpy as np
  2. out_df=(my_df.assign(
  3. col3 = lambda df_ : df_['col1'] + df_['col2'],
  4. bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
  5. df_['col3']%8==1,
  6. df_['col3']>100 ],
  7. choicelist=['bleep',
  8. 'bloop',
  9. 'bliip'],
  10. default='bluup')))

The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.

答案3

得分: 0

由于您的最终列需要复杂的逻辑,正如您所提到的,因此创建一个单独的函数并将其应用于行是有道理的。

  1. def my_func(x):
  2. if (x['col1'] + x['col2']) % 8 == 0:
  3. return 'bleep'
  4. else:
  5. return 'bloop'
  6. my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)

当您将 x 传递给函数时,实际上是将每一行传递给它,可以在函数内部使用 x['col1'] 等列的值。这样,您可以创建任何您需要的复杂函数。请注意,这里需要使用 axis=1 来传递行。

我没有包含创建 col3,仅提供一个示例。

英文:

Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.

  1. def my_func(x):
  2. if (x['col1'] + x['col2']) % 8 == 0:
  3. return 'bleep'
  4. else:
  5. return 'bloop'
  6. my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)

When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x['col1'] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.

I did not include creation of col3 just to provide a sample.

huangapple
  • 本文由 发表于 2023年2月10日 05:30:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404602.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定