如何在NumPy ndarrays中支持修改后的数据解释?

huangapple go评论67阅读模式
英文:

How to support modified data interpretations in NumPy ndarrays?

问题

我正在尝试编写一个Python 3类,用于在NumPy的np.ndarray中存储一些数据。然而,我希望我的类还包含有关如何解释数据值的信息。

例如,假设ndarraydtypenp.float32,但还有一个“颜色”来修改这些浮点值的含义。所以,如果我想要添加一个红色数字和一个蓝色数字,我必须首先将这两个数字都转换为洋红色,以便合法地添加它们的底层_data数组。添加的结果将具有_color = "洋红色"

这只是一个玩具示例。实际上,“颜色”不是一个字符串(最好将其视为整数),结果的“颜色”是从两个输入的“颜色”数学确定的,并且任何两个“颜色”之间的转换都是数学定义的。

class MyClass:
    
    def __init__(self, data : np.ndarray, color : str):
        self._data = data
        self._color = color
    
    # 示例:添加红色数字和蓝色数字会产生洋红色数字
    def convert(self, other_color):
        if self._color == "红色" and other_color == "蓝色":
            return MyClass(10*self._data, "洋红色")
        elif self._color == "蓝色" and other_color == "红色":
            return MyClass(self._data/10, "洋红色")
    
    def __add__(self, other):
        if other._color == self._color:
            # 如果颜色匹配,只需添加数据值
            return MyClass(self._data + other._data, self._color)
        else:
            # 如果颜色不匹配,然后在添加之前转换为输出颜色
            new_self = self.convert(other._color)
            new_other = other.convert(self._color)
            return new_self + new_other

我的问题是_color信息与_data同步存在。因此,我似乎无法为我的类定义明智的索引行为:

  • 如果我定义__getitem__返回self._data[i],那么_color信息将丢失。
  • 如果我定义__getitem__返回MyClass(self._data[i], self._color),那么我正在创建一个包含标量数的新对象。这将引发许多问题(例如,我可以合法地索引that_object[i],导致某些错误。
  • 如果我定义__getitem__返回MyClass(self._data[i:i+1], self._color),那么我正在索引一个数组以获取一个数组,这会导致许多其他问题。例如,my_object[i] = my_object[i]看起来合理,但会引发错误。

然后,我开始思考我真正想要的是每个不同的“颜色”都有不同的dtype。这样,索引的值将在dtype中免费编码“颜色”信息...但我不知道如何实现这一点。

理论上,“颜色”的总数可能大约是100,000。然而,在任何单个脚本执行中,不会使用少于100个。所以,我猜可能可以维护一个已使用的“颜色”列表/字典,以及它们如何映射到动态生成的类...但Python倾向于以我不希望的方式悄悄地转换类型,所以这可能不是正确的方法。

我唯一知道的是,我不想将“颜色”存储在每个数据值旁边。数据数组可以有数十亿个条目,所有条目的“颜色”都相同。

如何在保留“颜色”信息的同时拥有可用的类?

英文:

I am trying to write a Python 3 class that stores some data in a NumPy np.ndarray. However, I want my class to also contain a piece of information about how to interpret the data values.

For example, let's assume the dtype of the ndarray is np.float32, but there is also a "color" that modifies the meaning of those floating-point values. So, if I want to add a red number and a blue number, I must first convert both numbers to magenta in order to legally add their underlying _data arrays. The result of the addition will then have _color = "magenta".

This is just a toy example. In reality, the "color" is not a string (it's better to think of it as an integer), the "color" of the result is mathematically determined from the "color" of the two inputs, and the conversion between any two "colors" is mathematically defined.

class MyClass:
    
    def __init__(self, data : np.ndarray, color : str):
        self._data = data
        self._color = color
    
    
    # Example: Adding red numbers and blue numbers produces magenta numbers
    def convert(self, other_color):
        if self._color == "red" and other_color == "blue":
            return MyClass(10*self._data, "magenta")
        elif self._color == "blue" and other_color == "red":
            return MyClass(self._data/10, "magenta")
    
    
    def __add__(self, other):
        if other._color == self._color:
            # If the colors match, then just add the data values
            return MyClass(self._data + other._data, self._color)
        else:
            # If the colors don't match, then convert to the output color before adding
            new_self = self.convert(other._color)
            new_other = other.convert(self._color)
            return new_self + new_other

My problem is that the _color information lives alongside the _data. So, I can't seem to define sensible indexing behavior for my class:

  • If I define __getitem__ to return self._data[i], then the _color information is lost.
  • If I define __getitem__ to return MyClass(self._data[i], self._color) then I'm creating a new object that contains a scalar number. This will cause plenty of problems (for example, I can legally index that_object[i], leading to certain error.
  • If I define __getitem__ to return MyClass(self._data[i:i+1], self._color) then I'm indexing an array to get an array, which leads to plenty of other problems. For example, my_object[i] = my_object[i] looks sensible, but would throw an error.

I then started thinking that what I really want is a different dtype for each different "color". That way, the indexed value would have the "color" information encoded for free in the dtype... but I don't know how to implement that.

The theoretical total number of "colors" is likely to be roughly 100,000. However, fewer than 100 would be used in any single script execution. So, I guess it may be possible to maintain a list/dictionary/? of the used "colors" and how they map to dynamically generated classes ... but Python tends to quietly convert types in ways I don't expect, so that is probably not the right path to go down.

All I know is that I don't want to store the "color" alongside every data value. The data arrays can be ~billions of entries, with one "color" for all entries.

How can I keep track of this "color" information, while also having a usable class?

答案1

得分: 1

这个例子中,在call_common中的元组直接解包到ColouredArray构造函数中,因为interp返回一个单一的np.ndarray。在其他情况下,比如从numpy.linalg.lstsq返回的4个元组,调用者需要根据需要解包和重构彩色数组。

英文:

It's not tenable to define every dunder (__add__, etc.) It's also probably not tenable to inherit from np.ndarray, which is what it would take for a compatible derived class.

You can get away with a thin wrapper:

from typing import NamedTuple, Sequence, Any, Callable

import numpy as np

Colour = int
RED: Colour = 0x0000FF
MAGENTA: Colour = 0xFF00FF
BLUE: Colour = 0xFF0000


def common_colour(colours: Sequence[Colour]) -> Colour:
    # magic happens here
    return sum(colours)


class ColouredArray(NamedTuple):
    colour: Colour
    data: np.ndarray

    def __str__(self) -> str:
        return f'({self.colour}) {self.data}'

    def convert(self, new_colour: Colour) -> 'ColouredArray':
        return ColouredArray(
            # magic happens here
            data=self.data * new_colour/self.colour,
            colour=new_colour,
        )


def all_common(arrays: Sequence['ColouredArray']) -> tuple['ColouredArray']:
    new_colour = common_colour([a.colour for a in arrays])
    return tuple(
        array.convert(new_colour) for array in arrays
    )


def call_common(method: Callable, *args, **kwargs) -> tuple[Colour, Any]:
    new_colour = common_colour([
        arg.colour
        for arg in (*args, *kwargs.values())
        if isinstance(arg, ColouredArray)
    ])
    return new_colour, method(
        *(
            arg.convert(new_colour).data if isinstance(arg, ColouredArray) else arg
            for arg in args
        ),
        **{
            k: arg.convert(new_colour).data if isinstance(arg, ColouredArray) else arg
            for k, arg in kwargs.items()
        },
    )


y = ColouredArray(*call_common(
    np.interp,
    x=ColouredArray(RED, np.arange(5)),
    xp=ColouredArray(RED, np.arange(1, 30, 2)),
    fp=ColouredArray(BLUE, np.arange(11, 40, 2)),
    left=-1,
))
print(y)

In this example, the tuple from call_common is unpacked directly into the ColouredArray constructor because interp returns a single np.ndarray. In other situations such as the 4-tuple returned from numpy.linalg.lstsq it would be up to the caller to unpack and reconstruct a coloured array as needed.

huangapple
  • 本文由 发表于 2023年7月13日 00:10:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76672548.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定