英文:
Error when trying to apply a function to a dataframe
问题
I've written a function that takes in a string and a dictionary then matches the key string to the keys in the dictionary, fairly standard stuff but I can't work out how to apply it to a dataframe here is the function i want to apply:
def key_check(d={}, key=''):
new = {}
if not key:
return
if key in d:
new[key] = d[key]
else:
new[key] = ''
return new
And I'm attempting to apply it like this:
df["centre"] = df["centre"].apply(key_check, axis=1, args=df["name"])
I'm getting this error but don't understand what could be ambiguous, the column name is either a string or a null value, the column centre is a dictionary. How can I fix this?
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:1065, in SeriesApply.__init__(self, obj, func, convert_dtype, args, kwargs)
1055 def __init__(
1056 self,
1057 obj: Series,
(...)
1061 kwargs,
1062 ):
1063 self.convert_dtype = convert_dtype
-> 1065 super().__init__(
1066 obj,
1067 func,
1068 raw=False,
1069 result_type=None,
1070 args=args,
1071 kwargs=kwargs,
1072 )
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:119, in Apply.__init__(self, obj, func, raw, result_type, args, kwargs)
117 self.obj = obj
118 self.raw = raw
--> 119 self.args = args or ()
120 self.kwargs = kwargs or {}
122 if result_type not in [None, "reduce", "broadcast", "expand"]:
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
1525 @final
1526 def __nonzero__(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).__name__} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1530 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
<details>
<summary>英文:</summary>
I've written a function that takes in a string and a dictionary then matches the key string to the keys in the dictionary, fairly standard stuff but I can't work out how to apply it to a dataframe here is the function i want to apply:
def key_check(d={}, key=''):
new = {}
if not key:
return
if key in d:
new[key] = d[key]
else:
new[key] = ''
return new
And I'm attempting to apply it like this:
df["centre"] = df["centre"].apply(key_check, axis=1, args=df["name"])
I'm getting this error but don't understand what could be ambiguous, the column name is either a string or a null value, the column centre is a dictionary. How can I fix this?
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:1065, in SeriesApply.init(self, obj, func, convert_dtype, args, kwargs)
1055 def init(
1056 self,
1057 obj: Series,
(...)
1061 kwargs,
1062 ):
1063 self.convert_dtype = convert_dtype
-> 1065 super().init(
1066 obj,
1067 func,
1068 raw=False,
1069 result_type=None,
1070 args=args,
1071 kwargs=kwargs,
1072 )
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py:119, in Apply.init(self, obj, func, raw, result_type, args, kwargs)
117 self.obj = obj
118 self.raw = raw
--> 119 self.args = args or ()
120 self.kwargs = kwargs or {}
122 if result_type not in [None, "reduce", "broadcast", "expand"]:
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:1527, in NDFrame.nonzero(self)
1525 @final
1526 def nonzero(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).name} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1530 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
</details>
# 答案1
**得分**: 1
`df["centre"]` 是一个 `Series`,而不是一个 `DataFrame`。它没有轴,它的 `apply` 方法没有 `axis=...` 参数。查看其调用签名 [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html#pandas-series-apply):
```python
Series.apply(func, convert_dtype=True, args=(), **kwargs)
这里没有 "axis" 参数,所以它被放在了 kwargs
中。args
是一个用于在每次调用应用函数时使用的位置参数的元组。但是 pandas 中有这样一个奇怪的行,导致了问题,如堆栈跟踪中所见:
self.args = args or ()
它试图为 args
设置一个默认值,如果 args
是任何假值,则会出现问题。这对我来说似乎是一个 bug。args
默认为一个元组,并且定义为只接受元组,那么为什么要这样做,使得像 ""
,False
和 None
这样的值也能工作呢?不管怎样,这会让你感到困惑。当你将一个 Series
传递给这个参数时,or
中的真值测试就会导致问题。但是当你真正想要传递位置参数时,不应该传递一个 Series
。
如果你真的想要在每次调用 key_check
时传递 Series
,你可以这样做:
df["centre"] = df["centre"].apply(key_check, args=(df["name"],))
但是,参考 Kevin Spaghetti 的答案,我认为你想要进行逐行的应用。这意味着保留 "centre" 和 "name" 作为行数据。
import pandas as pd
df = pd.DataFrame({
'centre': [{'one': 1}, {'two': 2}, {'three': 3}, {'fubar':4}],
'name': ['one', 'two', '', 'four']
})
print(df)
print("=================================")
def key_check(row):
key = row["name"]
if key:
return {key: row["centre"].get(key, "")}
return None
df["centre"] = df[['centre', 'name']].apply(key_check, axis=1)
print(df)
输出如下:
centre name
0 {'one': 1} one
1 {'two': 2} two
2 {'three': 3}
3 {'fubar':4} four
=================================
centre name
0 {'one': 1} one
1 {'two': 2} two
2 None
3 {'four': ''} four
英文:
df["centre"]
is a Series
, not a DataFrame
. It doesn't have an axis and its apply
method doesn't have an axis=...
parameter. Looking at its call signature here:
Series.apply(func, convert_dtype=True, args=(), **kwargs)
there is no "axis" parameter so it is relegated to kwargs
. args
is a tuple
of positionals parameters to be used on each call to the applied function. But pandas has this strange line that causes you problems as seen in the stack trace
self.args = args or ()
Its trying to set a default for args if args is any falsy value. That seems like a bug to me. args
defaults to a tuple
and is defined to only take tuple
so why this thing where things like ""
, False
and None
work? Anyway, this trips you up. You pass a Series
for this parameter and that truth value test in the or
causes the problem. But you shouldn't be passing a Series
when you really want the positional arguments.
If you really wanted the Series
for each call to key_check
, you would do
df["centre"] = df["centre"].apply(key_check, args=(df["name"],))
But, cribbing off of Kevin Spaghetti's answer, I think you want a row-wise apply. And that means keeping "centre" and "name" as rows.
import pandas as pd
df = pd.DataFrame({
'centre': [{'one': 1}, {'two': 2}, {'three': 3}, {'fubar':4}],
'name': ['one', 'two', '', 'four']
})
print(df)
print("=================================")
def key_check(row):
key = row["name"]
if key:
return {key:row["centre"].get(key, "")}
return None
df["centre"] = df[['centre', 'name']].apply(key_check, axis=1)
print(df)
OUTPUT
centre name
0 {'one': 1} one
1 {'two': 2} two
2 {'three': 3}
3 {'fubar': 4} four
=================================
centre name
0 {'one': 1} one
1 {'two': 2} two
2 None
3 {'four': ''} four
答案2
得分: 0
打印函数中的关键参数,你会看到它不是单个记录,而是整个列(一个序列)。
这就是为什么 if not key
失败的原因,因为 pandas 不会隐式地将序列转换为真值(这就是为什么它告诉你要使用 all()
和 any()
方法的原因)。
我认为你想要做的是
df = pd.DataFrame({
'centre': [{'one': 1}, {'two': 2}, {'three': 3}],
'name': ['one', 'two', 'three']
})
def check(row):
print(row['centre'])
print(row['name'])
return row['centre'][row['name']]
df[['centre', 'name']].apply(check, axis=1)
这允许 apply 访问当前行的每个元素(axis=1)并返回一个字典。
英文:
Print the key argument in the function and you will see that it is not a single record but the whole column (a series).
This is why the if not key
fails, because pandas does not implicitly convert the series to a truth value (this is why it tells you to use the all()
and any()
methods).
What i think you want to do is
df = pd.DataFrame({
'centre': [{'one': 1}, {'two': 2}, {'three': 3}],
'name': ['one', 'two', 'three']
})
def check(row):
print(row['centre'])
print(row['name'])
return row['centre'][row['name']]
df[['centre', 'name']].apply(check, axis=1)
This allows apply to access every element of the current row (axis=1) and return a dictionary.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论