英文:
How can I add new attributes to a pandas.DataFrame derived class?
问题
我想创建一个从`pandas.DataFrame`派生的类,其`__init__()`略有不同。我会在新属性中存储一些额外的数据,最后调用`DataFrame.__init__()`。
```python
from pandas import DataFrame
class DataFrameDerived(DataFrame):
def __init__(self, *args, **kwargs):
self.derived = True
super().__init__(*args, **kwargs)
DataFrameDerived({'a':[1,2,3]})
在创建新属性(self.derived = True
)时,这段代码会出现以下错误:
> RecursionError: maximum recursion depth exceeded while calling a Python object
<details>
<summary>英文:</summary>
I want to create a class derived from `pandas.DataFrame` with a slightly different `__init__()`. I'll store some additional data in new attributes and finally call `DataFrame.__init__()`.
from pandas import DataFrame
class DataFrameDerived(DataFrame):
def init(self, *args, **kwargs):
self.derived = True
super().init(*args, **kwargs)
DataFrameDerived({'a':[1,2,3]})
This code gives the following error when creating the new attribute (`self.derived = True`):
> RecursionError: maximum recursion depth exceeded while calling a Python object
</details>
# 答案1
**得分**: 0
可以*可能*,但实现方式不太容易扩展。确实,[官方文档](https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures)建议使用替代方法。`pd.DataFrame`的实现复杂,涉及多重继承和各种混合方式,还使用各种属性设置/获取钩子,如`__getattr__`和`__setattr__`,以提供语法糖,例如使用`df.some_column`和`df.some_colum = whatever`,而不使用`df['some_column']`语法。如果查看堆栈跟踪,可以看到`__setattr__`正在发生*某些*事情:
RecursionError Traceback (most recent call last)
Cell In[1], line 8
5 self.derived = True
6 super().init(*args, **kwargs)
----> 8 DataFrameDerived({'a':[1,2,3]})
Cell In[1], line 5, in DataFrameDerived.init(self, *args, **kwargs)
4 def init(self, *args, **kwargs):
----> 5 self.derived = True
6 super().init(*args, **kwargs)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:6014, in NDFrame.setattr(self, name, value)
6012 else:
6013 try:
-> 6014 existing = getattr(self, name)
6015 if isinstance(existing, Index):
6016 object.setattr(self, name, value)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:5986, in NDFrame.getattr(self, name)
5976 """
5977 After regular attribute access, try looking up the name
5978 This allows simpler access to columns for interactive use.
5979 """
5980 # Note: obj.x will always call obj.getattribute('x') prior to
5981 # calling obj.getattr('x').
5982 if (
5983 name not in self._internal_names_set
5984 and name not in self._metadata
5985 and name not in self._accessors
-> 5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
5989 return object.getattribute(self, name)
了解了这些,你可以*盲目*地使用`object.__setattr__`来绕过此问题:
In [1]: from pandas import DataFrame
...:
...: class DataFrameDerived(DataFrame):
...: def init(self, *args, **kwargs):
...: object.setattr(self, 'derived', True)
...: super().init(*args, **kwargs)
...:
...: DataFrameDerived({'a':[1,2,3]})
Out[1]:
a
0 1
1 2
2 3
但再次强调,如果不真正了解实现方式,你只是在猜测“它能否工作”。它可能会工作。但正如链接文档中所指出的,你可能还需要[重写“构造函数”方法,以便在使用数据帧方法时,你的数据帧类型将返回其自身类型的数据帧](https://pandas.pydata.org/docs/development/extending.html#override-constructor-properties)。
除了使用继承之外,[另一种方法是注册其他访问器命名空间。](https://pandas.pydata.org/docs/development/extending.html#registering-custom-accessors)如果这对你有用,这是一种更简单的扩展pandas的方法。
如果不知道你确切想要实现什么,很难建议最佳方法。但你肯定应该从阅读我链接的有关[扩展Pandas](https://pandas.pydata.org/docs/development/extending.html#extending-pandas)的整个文档开始。
<details>
<summary>英文:</summary>
It is *possible*, but the implementation isn't very open to extension. Indeed, the [official docs](https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures) suggest using alternatives. The implementation of `pd.DataFrame` is complex, involving multiple inheritance with various mixins, and also, it uses the various attribute setting/getting hooks, like `__getattr__` and `__setattr__`, to among other things, provide syntactic sugar like using `df.some_column` and `df.some_colum = whatever` to work without using the `df['some_column']` syntax. If you look at the stack trace, you can see that *something* is going on with `__setattr__`:
RecursionError Traceback (most recent call last)
Cell In[1], line 8
5 self.derived = True
6 super().__init__(*args, **kwargs)
----> 8 DataFrameDerived({'a':[1,2,3]})
Cell In[1], line 5, in DataFrameDerived.__init__(self, *args, **kwargs)
4 def __init__(self, *args, **kwargs):
----> 5 self.derived = True
6 super().__init__(*args, **kwargs)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:6014, in NDFrame.__setattr__(self, name, value)
6012 else:
6013 try:
-> 6014 existing = getattr(self, name)
6015 if isinstance(existing, Index):
6016 object.__setattr__(self, name, value)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:5986, in NDFrame.__getattr__(self, name)
5976 """
5977 After regular attribute access, try looking up the name
5978 This allows simpler access to columns for interactive use.
5979 """
5980 # Note: obj.x will always call obj.__getattribute__('x') prior to
5981 # calling obj.__getattr__('x').
5982 if (
5983 name not in self._internal_names_set
5984 and name not in self._metadata
5985 and name not in self._accessors
-> 5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
5989 return object.__getattribute__(self, name)
Knowing this, one might *blindly* just use `object.__setattr__` instead, to bypass this:
In [1]: from pandas import DataFrame
...:
...: class DataFrameDerived(DataFrame):
...: def __init__(self, *args, **kwargs):
...: object.__setattr__(self, 'derived', True)
...: super().__init__(*args, **kwargs)
...:
...: DataFrameDerived({'a':[1,2,3]})
Out[1]:
a
0 1
1 2
2 3
But again, without really understanding the implementation, you are just crossing your fingers and hoping "it works". Which it may. But as noted in the linked docs, you are possibly also going to want to [override the "constructor" methods, so that your data frame type will return data frames of it's own type when using dataframe methods](https://pandas.pydata.org/docs/development/extending.html#override-constructor-properties).
Instead of using inheritance, [an alternative is to instead register other accessor namespaces.](https://pandas.pydata.org/docs/development/extending.html#registering-custom-accessors). This is one simpler method to extend pandas, if that works for you.
Without knowing more details about what exactly you are trying to accomplish, it is difficult to suggest the best way forward. But you should definitely start by reading the whole of those docs I've linked to on [Extending Pandas](https://pandas.pydata.org/docs/development/extending.html#extending-pandas)
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论