英文:
How to support modified data interpretations in NumPy ndarrays?
问题
我正在尝试编写一个Python 3类,用于在NumPy的np.ndarray
中存储一些数据。然而,我希望我的类还包含有关如何解释数据值的信息。
例如,假设ndarray
的dtype
是np.float32
,但还有一个“颜色”来修改这些浮点值的含义。所以,如果我想要添加一个红色数字和一个蓝色数字,我必须首先将这两个数字都转换为洋红色,以便合法地添加它们的底层_data
数组。添加的结果将具有_color = "洋红色"
。
这只是一个玩具示例。实际上,“颜色”不是一个字符串(最好将其视为整数),结果的“颜色”是从两个输入的“颜色”数学确定的,并且任何两个“颜色”之间的转换都是数学定义的。
class MyClass:
def __init__(self, data : np.ndarray, color : str):
self._data = data
self._color = color
# 示例:添加红色数字和蓝色数字会产生洋红色数字
def convert(self, other_color):
if self._color == "红色" and other_color == "蓝色":
return MyClass(10*self._data, "洋红色")
elif self._color == "蓝色" and other_color == "红色":
return MyClass(self._data/10, "洋红色")
def __add__(self, other):
if other._color == self._color:
# 如果颜色匹配,只需添加数据值
return MyClass(self._data + other._data, self._color)
else:
# 如果颜色不匹配,然后在添加之前转换为输出颜色
new_self = self.convert(other._color)
new_other = other.convert(self._color)
return new_self + new_other
我的问题是_color
信息与_data
同步存在。因此,我似乎无法为我的类定义明智的索引行为:
- 如果我定义
__getitem__
返回self._data[i]
,那么_color
信息将丢失。 - 如果我定义
__getitem__
返回MyClass(self._data[i], self._color)
,那么我正在创建一个包含标量数的新对象。这将引发许多问题(例如,我可以合法地索引that_object[i]
,导致某些错误。 - 如果我定义
__getitem__
返回MyClass(self._data[i:i+1], self._color)
,那么我正在索引一个数组以获取一个数组,这会导致许多其他问题。例如,my_object[i] = my_object[i]
看起来合理,但会引发错误。
然后,我开始思考我真正想要的是每个不同的“颜色”都有不同的dtype
。这样,索引的值将在dtype
中免费编码“颜色”信息...但我不知道如何实现这一点。
理论上,“颜色”的总数可能大约是100,000。然而,在任何单个脚本执行中,不会使用少于100个。所以,我猜可能可以维护一个已使用的“颜色”列表/字典,以及它们如何映射到动态生成的类...但Python倾向于以我不希望的方式悄悄地转换类型,所以这可能不是正确的方法。
我唯一知道的是,我不想将“颜色”存储在每个数据值旁边。数据数组可以有数十亿个条目,所有条目的“颜色”都相同。
如何在保留“颜色”信息的同时拥有可用的类?
英文:
I am trying to write a Python 3 class that stores some data in a NumPy np.ndarray
. However, I want my class to also contain a piece of information about how to interpret the data values.
For example, let's assume the dtype
of the ndarray
is np.float32
, but there is also a "color" that modifies the meaning of those floating-point values. So, if I want to add a red number and a blue number, I must first convert both numbers to magenta in order to legally add their underlying _data
arrays. The result of the addition will then have _color = "magenta"
.
This is just a toy example. In reality, the "color" is not a string (it's better to think of it as an integer), the "color" of the result is mathematically determined from the "color" of the two inputs, and the conversion between any two "colors" is mathematically defined.
class MyClass:
def __init__(self, data : np.ndarray, color : str):
self._data = data
self._color = color
# Example: Adding red numbers and blue numbers produces magenta numbers
def convert(self, other_color):
if self._color == "red" and other_color == "blue":
return MyClass(10*self._data, "magenta")
elif self._color == "blue" and other_color == "red":
return MyClass(self._data/10, "magenta")
def __add__(self, other):
if other._color == self._color:
# If the colors match, then just add the data values
return MyClass(self._data + other._data, self._color)
else:
# If the colors don't match, then convert to the output color before adding
new_self = self.convert(other._color)
new_other = other.convert(self._color)
return new_self + new_other
My problem is that the _color
information lives alongside the _data
. So, I can't seem to define sensible indexing behavior for my class:
- If I define
__getitem__
to returnself._data[i]
, then the_color
information is lost. - If I define
__getitem__
to returnMyClass(self._data[i], self._color)
then I'm creating a new object that contains a scalar number. This will cause plenty of problems (for example, I can legally indexthat_object[i]
, leading to certain error. - If I define
__getitem__
to returnMyClass(self._data[i:i+1], self._color)
then I'm indexing an array to get an array, which leads to plenty of other problems. For example,my_object[i] = my_object[i]
looks sensible, but would throw an error.
I then started thinking that what I really want is a different dtype
for each different "color". That way, the indexed value would have the "color" information encoded for free in the dtype
... but I don't know how to implement that.
The theoretical total number of "colors" is likely to be roughly 100,000. However, fewer than 100 would be used in any single script execution. So, I guess it may be possible to maintain a list/dictionary/? of the used "colors" and how they map to dynamically generated classes ... but Python tends to quietly convert types in ways I don't expect, so that is probably not the right path to go down.
All I know is that I don't want to store the "color" alongside every data value. The data arrays can be ~billions of entries, with one "color" for all entries.
How can I keep track of this "color" information, while also having a usable class?
答案1
得分: 1
这个例子中,在call_common
中的元组直接解包到ColouredArray
构造函数中,因为interp
返回一个单一的np.ndarray
。在其他情况下,比如从numpy.linalg.lstsq
返回的4个元组,调用者需要根据需要解包和重构彩色数组。
英文:
It's not tenable to define every dunder (__add__
, etc.) It's also probably not tenable to inherit from np.ndarray
, which is what it would take for a compatible derived class.
You can get away with a thin wrapper:
from typing import NamedTuple, Sequence, Any, Callable
import numpy as np
Colour = int
RED: Colour = 0x0000FF
MAGENTA: Colour = 0xFF00FF
BLUE: Colour = 0xFF0000
def common_colour(colours: Sequence[Colour]) -> Colour:
# magic happens here
return sum(colours)
class ColouredArray(NamedTuple):
colour: Colour
data: np.ndarray
def __str__(self) -> str:
return f'({self.colour}) {self.data}'
def convert(self, new_colour: Colour) -> 'ColouredArray':
return ColouredArray(
# magic happens here
data=self.data * new_colour/self.colour,
colour=new_colour,
)
def all_common(arrays: Sequence['ColouredArray']) -> tuple['ColouredArray']:
new_colour = common_colour([a.colour for a in arrays])
return tuple(
array.convert(new_colour) for array in arrays
)
def call_common(method: Callable, *args, **kwargs) -> tuple[Colour, Any]:
new_colour = common_colour([
arg.colour
for arg in (*args, *kwargs.values())
if isinstance(arg, ColouredArray)
])
return new_colour, method(
*(
arg.convert(new_colour).data if isinstance(arg, ColouredArray) else arg
for arg in args
),
**{
k: arg.convert(new_colour).data if isinstance(arg, ColouredArray) else arg
for k, arg in kwargs.items()
},
)
y = ColouredArray(*call_common(
np.interp,
x=ColouredArray(RED, np.arange(5)),
xp=ColouredArray(RED, np.arange(1, 30, 2)),
fp=ColouredArray(BLUE, np.arange(11, 40, 2)),
left=-1,
))
print(y)
In this example, the tuple from call_common
is unpacked directly into the ColouredArray
constructor because interp
returns a single np.ndarray
. In other situations such as the 4-tuple returned from numpy.linalg.lstsq
it would be up to the caller to unpack and reconstruct a coloured array as needed.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论