In python how to create multiple dataclasses instances with different objects instance in the fields?

huangapple go评论75阅读模式
英文:

In python how to create multiple dataclasses instances with different objects instance in the fields?

问题

I'm trying to write a parser and I'm missing something in the dataclasses usage.
I'm trying to be as generic as possible and to do the logic in the parent class but every child has the same values in the end.
I'm confused with what dataclass decorator does with class variables and instances variables.
I should probably not use self.__dict__ in my post_init.

How would you do to have unique instances using the same idea?

from dataclasses import dataclass

class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None

@dataclass
class RecordParser():
    line: str
    def __post_init__(self):
        for k, var in self.__dict__.items():
            if isinstance(var, VarSlice):
                self.__dict__[k].value = self.line[var.slice]

@dataclass
class HeaderRecord(RecordParser):
    sender : VarSlice = VarSlice(3, 8)


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

Result:

abcde
12345

Expected result is:

abcde
12345

I tried changing VarSlice to a dataclass too but it changed nothing.

英文:

I'm trying to write a parser and I'm missing something in the dataclasses usage.
I'm trying to be as generic as possible and to do the logic in the parent class but every child has the sames values in the end.
I'm confused with what dataclasse decorator do with class variables and instances variables.
I should probably not use self.__dict__ in my post_init.

How would you do to have unique instances using the same idea ?

from dataclasses import dataclass

class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None

@dataclass
class RecordParser():
    line: str
    def __post_init__(self):
        for k, var in self.__dict__.items():
            if isinstance(var, VarSlice):
                self.__dict__[k].value = self.line[var.slice]

@dataclass
class HeaderRecord(RecordParser):
    sender : VarSlice = VarSlice(3, 8)


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

Result :

45678
45678

Expected result is :

abcde
45678

I tried changing VarSlice to a dataclass too but it changed nothing.

答案1

得分: 3

以下是代码部分的中文翻译:

这种奇怪的行为是因为当你执行以下操作时:

sender: VarSlice = VarSlice(3, 8)

这里的默认值是一个特定实例VarSlice(3, 8),它在所有HeaderRecord实例之间共享。

可以通过打印VarSlice对象的id来确认这一点——如果在多次构建RecordParser子类的实例时它们相同,那么就会出现问题:

if isinstance(var, VarSlice):
    print(id(var))
    ...

这很可能不是你想要的行为。

期望的行为很可能是每次实例化一个新的HeaderRecord对象时都创建一个新的VarSlice(3, 8)实例。

要解决这个问题,我建议使用default_factory而不是default,因为这是字段具有可变默认值时的推荐(并有文档支持的)方法。

即:

sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))

而不是:

sender: VarSlice = VarSlice(3, 8)

上面的代码在技术上等同于:

sender: VarSlice = field(default=VarSlice(3, 8))

具有示例的完整代码:

from dataclasses import dataclass, field


class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None


@dataclass
class RecordParser:
    line: str

    def __post_init__(self):
        for var in self.__dict__.values():
            if isinstance(var, VarSlice):
                var.value = self.line[var.slice]


@dataclass
class HeaderRecord(RecordParser):
    sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

现在打印出:

defgh
45678

提高性能

尽管这明显不是性能瓶颈,但在创建多个RecordParser子类实例时,可能存在一些潜在的性能提升空间。

可能影响性能的原因包括:

  • 当在每次实例化中都存在一个for循环来迭代指定类型VarSlice的数据类字段时,可以避免循环。
  • 每次访问实例上的__dict__属性,也可以避免。需要注意的是,实际上使用dataclasses.fields()反而更糟糕,因为这个值不是基于每个类缓存的。

为了解决这个问题,我建议通过dataclasses._create_fn()(或复制这个逻辑以避免依赖于“内部”函数)来静态生成子类的__post_init__()方法,并在子类运行@dataclass装饰器之前设置它。

一个简单的方法是利用__init_subclass__()挂钩,它在类被子类化时运行,如下所示。

# 用于测试注释提前声明(即,作为字符串)
# from __future__ import annotations

from collections import deque
from dataclasses import dataclass, field, _create_fn


class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None


@dataclass
class RecordParser:
    line: str

    def __init_subclass__(cls, **kwargs):
        # 包含动态生成的`__post_init__()`主体行的列表
        post_init_lines = deque()
        # 循环遍历类注释(这是“dataclasses”模块如何执行的“简化”版本)
        for name, tp in cls.__annotations__.items():
            if tp is VarSlice or (isinstance(tp, str) and tp == VarSlice.__name__):
                post_init_lines.append(f'var = self.{name}')
                post_init_lines.append('var.value = line[var.slice]')
        # 如果没有类型为`VarSlice`的dataclass字段,我们就完成了
        if post_init_lines:
            post_init_lines.appendleft('line = self.line')
            cls.__post_init__ = _create_fn('__post_init__', ('self', ), post_init_lines)


@dataclass
class HeaderRecord(RecordParser):
    sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)
英文:

This curious behavior is observed, since when you do:

sender: VarSlice = VarSlice(3, 8)

The default value here is a specific instance VarSlice(3, 8) - which is shared between all HeaderRecord instances.

This can be confirmed, by printing the id of the VarSlice object - if they are the same when constructing an instance of a RecordParser subclass more than once, then we have a problem:

if isinstance(var, VarSlice):
    print(id(var))
    ...

This is very likely not what you want.

The desired behavior is likely going to be create a new VarSlice(3, 8) instance, each time a new HeaderRecord object is instantiated.

To resolve the issue, I would suggest to use default_factory instead of default, as this is the recommended (and documented) approach for fields with mutable default values.

i.e.,

sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))

instead of:

sender: VarSlice = VarSlice(3, 8)

The above, being technically equivalent to:

sender: VarSlice = field(default=VarSlice(3, 8))

Full code with example:

from dataclasses import dataclass, field


class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None


@dataclass
class RecordParser:
    line: str

    def __post_init__(self):
        for var in self.__dict__.values():
            if isinstance(var, VarSlice):
                var.value = self.line[var.slice]


@dataclass
class HeaderRecord(RecordParser):
    sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

Now prints:

defgh
45678

Improving Performance

Though clearly this is not a bottleneck, when creating multiple instances of a RecordParser subclass, I note there could be areas for potential improvement.

Reasons that performance could be (slightly) impacted:

  • There currently exists a for loop on each instantiation to iterate over dataclass fields which are of a specified type VarSlice, where a loop could potentially be avoided.
  • The __dict__ attribute on the instance is accessed each time, which can also be avoided. Note that using dataclasses.fields() instead is actually worse, as this value is not cached on a per-class basis.
  • An isinstance check is run on each dataclass field, each time a subclass is instantiated.

To resolve this, I could suggest improving performance by statically generating a __post__init__() method for the subclass via dataclasses._create_fn() (or copying this logic to avoid dependency on an "internal" function), and setting it on the subclass, i.e. before the @dataclass decorator runs for the subclass.

An easy way could be to utilize the __init_subclass__() hook which runs when a class is subclassed, as shown below.

# to test when annotations are forward-declared (i,e. as strings)
# from __future__ import annotations

from collections import deque
from dataclasses import dataclass, field, _create_fn


class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None


@dataclass
class RecordParser:
    line: str

    def __init_subclass__(cls, **kwargs):
        # list containing the (dynamically-generated) body lines of `__post_init__()`
        post_init_lines = deque()
        # loop over class annotations (this is a greatly "simplified"
        # version of how the `dataclasses` module does it)
        for name, tp in cls.__annotations__.items():
            if tp is VarSlice or (isinstance(tp, str) and tp == VarSlice.__name__):
                post_init_lines.append(f'var = self.{name}')
                post_init_lines.append('var.value = line[var.slice]')
        # if there are no dataclass fields of type `VarSlice`, we are done
        if post_init_lines:
            post_init_lines.appendleft('line = self.line')
            cls.__post_init__ = _create_fn('__post_init__', ('self', ), post_init_lines)


@dataclass
class HeaderRecord(RecordParser):
    sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

huangapple
  • 本文由 发表于 2023年3月21日 02:35:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75794069-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定