Error when trying to apply a function to a dataframe

huangapple go评论78阅读模式
英文:

Error when trying to apply a function to a dataframe

问题

I've written a function that takes in a string and a dictionary then matches the key string to the keys in the dictionary, fairly standard stuff but I can't work out how to apply it to a dataframe here is the function i want to apply:

def key_check(d={}, key=''):
    new = {}
    if not key:
        return 
    if key in d:
        new[key] = d[key]
    else:
        new[key] = ''
    return new

And I'm attempting to apply it like this:

df["centre"] = df["centre"].apply(key_check, axis=1, args=df["name"])

I'm getting this error but don't understand what could be ambiguous, the column name is either a string or a null value, the column centre is a dictionary. How can I fix this?

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4323 def apply(
   4324     self,
   4325     func: AggFuncType,
   (...)
   4328     **kwargs,
   4329 ) -> DataFrame | Series:
   4330     """
   4331     Invoke function on values of Series.
   4332 
   (...)
   4431     dtype: float64
   4432     """
-> 4433     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:1065, in SeriesApply.__init__(self, obj, func, convert_dtype, args, kwargs)
   1055 def __init__(
   1056     self,
   1057     obj: Series,
   (...)
   1061     kwargs,
   1062 ):
   1063     self.convert_dtype = convert_dtype
-> 1065     super().__init__(
   1066         obj,
   1067         func,
   1068         raw=False,
   1069         result_type=None,
   1070         args=args,
   1071         kwargs=kwargs,
   1072     )

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:119, in Apply.__init__(self, obj, func, raw, result_type, args, kwargs)
    117 self.obj = obj
    118 self.raw = raw
--> 119 self.args = args or ()
    120 self.kwargs = kwargs or {}
    122 if result_type not in [None, "reduce", "broadcast", "expand"]:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
   1525 @final
   1526 def __nonzero__(self):
-> 1527     raise ValueError(
   1528         f"The truth value of a {type(self).__name__} is ambiguous. "
   1529         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1530     )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

<details>
<summary>英文:</summary>

I&#39;ve written a function that takes in a string and a dictionary then matches the key string to the keys in the dictionary, fairly standard stuff but I can&#39;t work out how to apply it to a dataframe here is the function i want to apply:

    def key_check(d={}, key=&#39;&#39;):
    new = {}
    if not key:
        return 
    if key in d:
        new[key] = d[key]
    else:
        new[key] = &#39;&#39;
    return new

And I&#39;m attempting to apply it like this:

    df[&quot;centre&quot;] = df[&quot;centre&quot;].apply(key_check, axis=1, args=df[&quot;name&quot;])

I&#39;m getting this error but don&#39;t understand what could be ambiguous, the column name is either a string or a null value, the column centre is a dictionary. How can I fix this?

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:1065, in SeriesApply.init(self, obj, func, convert_dtype, args, kwargs)
1055 def init(
1056 self,
1057 obj: Series,
(...)
1061 kwargs,
1062 ):
1063 self.convert_dtype = convert_dtype
-> 1065 super().init(
1066 obj,
1067 func,
1068 raw=False,
1069 result_type=None,
1070 args=args,
1071 kwargs=kwargs,
1072 )

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:119, in Apply.init(self, obj, func, raw, result_type, args, kwargs)
117 self.obj = obj
118 self.raw = raw
--> 119 self.args = args or ()
120 self.kwargs = kwargs or {}
122 if result_type not in [None, "reduce", "broadcast", "expand"]:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:1527, in NDFrame.nonzero(self)
1525 @final
1526 def nonzero(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).name} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1530 )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


</details>
# 答案1
**得分**: 1
`df["centre"]` 是一个 `Series`,而不是一个 `DataFrame`。它没有轴,它的 `apply` 方法没有 `axis=...` 参数。查看其调用签名 [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html#pandas-series-apply):
```python
Series.apply(func, convert_dtype=True, args=(), **kwargs)

这里没有 "axis" 参数,所以它被放在了 kwargs 中。args 是一个用于在每次调用应用函数时使用的位置参数的元组。但是 pandas 中有这样一个奇怪的行,导致了问题,如堆栈跟踪中所见:

self.args = args or ()

它试图为 args 设置一个默认值,如果 args 是任何假值,则会出现问题。这对我来说似乎是一个 bug。args 默认为一个元组,并且定义为只接受元组,那么为什么要这样做,使得像 ""FalseNone 这样的值也能工作呢?不管怎样,这会让你感到困惑。当你将一个 Series 传递给这个参数时,or 中的真值测试就会导致问题。但是当你真正想要传递位置参数时,不应该传递一个 Series

如果你真的想要在每次调用 key_check 时传递 Series,你可以这样做:

df["centre"] = df["centre"].apply(key_check, args=(df["name"],))

但是,参考 Kevin Spaghetti 的答案,我认为你想要进行逐行的应用。这意味着保留 "centre" 和 "name" 作为行数据。

import pandas as pd

df = pd.DataFrame({
    'centre': [{'one': 1}, {'two': 2}, {'three': 3}, {'fubar':4}],
    'name': ['one', 'two', '', 'four']
})

print(df)

print("=================================")

def key_check(row):
    key = row["name"]
    if key:
        return {key: row["centre"].get(key, "")}
    return None

df["centre"] = df[['centre', 'name']].apply(key_check, axis=1)

print(df)

输出如下:

            centre  name
0    {'one': 1}   one
1    {'two': 2}   two
2  {'three': 3}      
3  {'fubar':4}   four
=================================
centre  name
0    {'one': 1}   one
1    {'two': 2}   two
2  None           
3  {'four': ''}   four
英文:

df[&quot;centre&quot;] is a Series, not a DataFrame. It doesn't have an axis and its apply method doesn't have an axis=... parameter. Looking at its call signature here:

Series.apply(func, convert_dtype=True, args=(), **kwargs)

there is no "axis" parameter so it is relegated to kwargs. args is a tuple of positionals parameters to be used on each call to the applied function. But pandas has this strange line that causes you problems as seen in the stack trace

self.args = args or ()

Its trying to set a default for args if args is any falsy value. That seems like a bug to me. args defaults to a tuple and is defined to only take tuple so why this thing where things like &quot;&quot;, False and None work? Anyway, this trips you up. You pass a Series for this parameter and that truth value test in the or causes the problem. But you shouldn't be passing a Series when you really want the positional arguments.

If you really wanted the Series for each call to key_check, you would do

df[&quot;centre&quot;] = df[&quot;centre&quot;].apply(key_check, args=(df[&quot;name&quot;],))

But, cribbing off of Kevin Spaghetti's answer, I think you want a row-wise apply. And that means keeping "centre" and "name" as rows.

import pandas as pd
df = pd.DataFrame({
&#39;centre&#39;: [{&#39;one&#39;: 1}, {&#39;two&#39;: 2}, {&#39;three&#39;: 3}, {&#39;fubar&#39;:4}],
&#39;name&#39;: [&#39;one&#39;, &#39;two&#39;, &#39;&#39;, &#39;four&#39;]
})
print(df)
print(&quot;=================================&quot;)
def key_check(row):
key = row[&quot;name&quot;]
if key:
return {key:row[&quot;centre&quot;].get(key, &quot;&quot;)}
return None
df[&quot;centre&quot;] = df[[&#39;centre&#39;, &#39;name&#39;]].apply(key_check, axis=1)
print(df)

OUTPUT

        centre  name
0    {&#39;one&#39;: 1}   one
1    {&#39;two&#39;: 2}   two
2  {&#39;three&#39;: 3}      
3  {&#39;fubar&#39;: 4}  four
=================================
centre  name
0    {&#39;one&#39;: 1}   one
1    {&#39;two&#39;: 2}   two
2          None      
3  {&#39;four&#39;: &#39;&#39;}  four

答案2

得分: 0

打印函数中的关键参数,你会看到它不是单个记录,而是整个列(一个序列)。
这就是为什么 if not key 失败的原因,因为 pandas 不会隐式地将序列转换为真值(这就是为什么它告诉你要使用 all()any() 方法的原因)。

我认为你想要做的是

df = pd.DataFrame({
    'centre': [{'one': 1}, {'two': 2}, {'three': 3}],
    'name': ['one', 'two', 'three']
})

def check(row):
  print(row['centre'])
  print(row['name'])

  return row['centre'][row['name']]

df[['centre', 'name']].apply(check, axis=1)

这允许 apply 访问当前行的每个元素(axis=1)并返回一个字典。

英文:

Print the key argument in the function and you will see that it is not a single record but the whole column (a series).
This is why the if not key fails, because pandas does not implicitly convert the series to a truth value (this is why it tells you to use the all() and any() methods).

What i think you want to do is

df = pd.DataFrame({
    &#39;centre&#39;: [{&#39;one&#39;: 1}, {&#39;two&#39;: 2}, {&#39;three&#39;: 3}],
    &#39;name&#39;: [&#39;one&#39;, &#39;two&#39;, &#39;three&#39;]
})

def check(row):
  print(row[&#39;centre&#39;])
  print(row[&#39;name&#39;])

  return row[&#39;centre&#39;][row[&#39;name&#39;]]

df[[&#39;centre&#39;, &#39;name&#39;]].apply(check, axis=1)

This allows apply to access every element of the current row (axis=1) and return a dictionary.

huangapple
  • 本文由 发表于 2023年6月9日 00:18:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76433888.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定