Python多进程在类内部使用,具有不一致的实例变量ID。

huangapple go评论68阅读模式
英文:

Python Multiprocessing inside a class with inconsistent instance variable IDs

问题

Here's the translated content without the code parts:

我对Python多进程编程相对不熟悉,正在努力理解这里发生了什么。我有意使用面向对象编程,因为我正在开发的代码库将面向对象编程原则作为大型MVC设计的一部分。该代码在Ubuntu Linux上运行。

以下是代码的执行过程:

  1. 实例化CallingClass的一个实例。
  2. CallingClass__init__方法调用父类ParentClass的构造函数,并传递一个包含三个字符串的列表。
  3. 循环3次,每次迭代:
  4. 调用doTheHardWork()方法。
  5. 设置了一个处理器池。目标是self.parallelMethod,并传入一个字符串列表(A、B、C等)。我对这个参数不感兴趣,但稍后可能会用到。
  6. 每个进程打印父属性self.parentStuffList的ID。

如果您想了解以下内容,我将不胜感激:

  1. 为什么第一次迭代的进程生成唯一的ID,而后续迭代都具有相同的ID。从一条注释中可以看出,启动方法(fork、spawn)会产生不同的ID值的一致性。
  2. 鉴于多进程会生成单独的Python解释器,这是否意味着通过父类继承的数据会有N个实例?

我难以想象面向对象编程和多进程之间的相互作用。

英文:

I'm relatively inexperienced with Python multiprocessing programming and struggling to understand what's happening here. I'm intentionally using object-oriented programming because the codebase I'm developing uses OOP principles as part of a large MVC design. The code is running on Ubuntu Linux.

The code does the following:

  1. Instantiates an instance of the CallingClass.
  2. The __ init __ of CallingClass calls the constructor of parent ParentClass and passes a list of three strings.
  3. Loops 3 times. Per iteration:
  4. Call the method doTheHardWork()
  5. A processor pool is set up. The target self.parallelMethod and I pass in a list of strings (A, B, C,...). I'm not interested in this argument, but I will be later.
  6. Each process prints the id of the parent attribute self.parentStuffList.
import multiprocessing

class ParentClass:

    def __init__(self, parentStuffList_X):
        self.parentStuffList = parentStuffList_X

class CallingClass(ParentClass):

    def __init__(self):
        ParentClass.__init__(self, ["foo", "bar", "oh dear!"])

    def parallelMethod(self, stuffPassed):
        # let's explore the data in the parent
        # is the parent list different for every process?
        print(str(id(self.parentStuffList)))

    def doTheHardWork(self):
        stuffToPassList = ["A", "B", "C", "D", "E", "F", "G", "H"]  # not bothered by you, yet!

        pool = multiprocessing.Pool(4)
        for _ in pool.map(self.parallelMethod, stuffToPassList):
            pass

if __name__ == '__main__':
    callingClass = CallingClass()

    # Call this multiple times
    for i in range(3):
        callingClass.doTheHardWork()
        print("..............................")

I would be very grateful if you could help me understand:

  1. Why the first iteration's processes yield unique IDs, yet the subsequent iterations all have the same ID. From a comment, the start method (fork, spawn) produce different uniformity in ID values.
  2. Given that multiprocessing spawns separate Python interpreters, does this mean there will be N instances of the data inherited through the parent class?

I'm struggling to visualise the interplay of OOP and multiprocessing.

140254041045760
140254041560896
140254041561408
140254041045760
140254041560896
140254041045760
140254041560896
140254041561920

..............................
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
..............................
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
140254040957952
..............................

答案1

得分: 2

在CPython中,id 函数不返回“通用唯一实例ID”,而是返回“对象在内存中的地址”。

由于 parallelMethod 的执行时间非常短,在处理下一个元素之前,前一个元素的处理和清理就已完成。因此,该元素可能被放置在相同的地址。这就是为什么你的代码输出相同的ID。请注意,对于第一次迭代,相同的ID也会重复使用。

140254041045760 <-- A
140254041560896 <-- B
140254041561408
140254041045760 <-- A
140254041560896 <-- B
140254041045760 <-- A
140254041560896 <-- B
140254041561920

第一次迭代和后续迭代之间的行为差异可能是由于内存分配、缓存等的时间差异导致的。

为了测试这个假设,我们可以通过以下方式使 parallelMethod 的执行时间足够长:

    def parallelMethod(self, stuffPassed):
        # 让我们探索父类中的数据
        # 对于每个进程,父列表是否不同?
        print(str(id(self.parentStuffList)))
        time.sleep(1)  # <-- 模拟长时间执行。

结果如下:

2237529526912 <-- A
2829585066624 <-- B
1134667074496 <-- C
1799569073216 <-- D
2829585066624 <-- B
2237529526912 <-- A
1134667074496 <-- C
1799569073216 <-- D
..............................
2009911254528 <-- E
2715173195392 <-- F
1693210775296 <-- G
2660560565632 <-- H
2715173195392 <-- F
2009911254528 <-- E
1693210775296 <-- G
2660560565632 <-- H
..............................
1904427212864 <-- I
2102102948928 <-- J
2602047708288 <-- K
2153079620480 <-- L
1904427212864 <-- I
2102102948928 <-- J
2602047708288 <-- K
2153079620480 <-- L
..............................

正如你所看到的,对于每次迭代,前4个ID是唯一的,但后4个ID被重复使用。这是因为在后4个元素的处理开始时,前4个元素的处理已经完成和清理(这是池应该工作的方式)。如果你将池中的进程数更改为8,你将看到所有的ID都是唯一的。

至于第二个问题:

考虑到多进程会生成单独的Python解释器,这是否意味着通过父类继承的数据会有N个实例?

是的,你的理解是正确的。实例可以被放置在如上所述的相同地址,但不会同时存在,因此它们是不同的实例。但请注意,如果你设置了 Pool(N),将会有N+1个实例,包括一个在主进程中。

英文:

First, in CPython, the id function does not return the "Universally unique instance ID", but the "address of the object in memory".

Because the execution time of parallelMethod is very short, processing of the previous element is finished and cleaned up before processing of the next element can begin. As a result, the element may be placed at the same address. This is why your code outputs the same ID. Note that the same IDs are used repeatedly for the first iteration as well.

140254041045760 <-- A
140254041560896 <-- B
140254041561408
140254041045760 <-- A
140254041560896 <-- B
140254041045760 <-- A
140254041560896 <-- B
140254041561920

The difference in behavior between the first iteration and subsequent iterations may be timing differences due to memory allocation, cache, etc.

To test this hypothesis, we can make the execution time of the parallelMethod long enough as follows:

    def parallelMethod(self, stuffPassed):
        # let's explore the data in the parent
        # is the parent list different for every process?
        print(str(id(self.parentStuffList)))
        time.sleep(1)  # <-- Simulate the long execution time.

The result is as follows:

2237529526912 <-- A
2829585066624 <-- B
1134667074496 <-- C
1799569073216 <-- D
2829585066624 <-- B
2237529526912 <-- A
1134667074496 <-- C
1799569073216 <-- D
..............................
2009911254528 <-- E
2715173195392 <-- F
1693210775296 <-- G
2660560565632 <-- H
2715173195392 <-- F
2009911254528 <-- E
1693210775296 <-- G
2660560565632 <-- H
..............................
1904427212864 <-- I
2102102948928 <-- J
2602047708288 <-- K
2153079620480 <-- L
1904427212864 <-- I
2102102948928 <-- J
2602047708288 <-- K
2153079620480 <-- L
..............................

As you can see, for each iteration, the first 4 IDs are unique, but the latter 4 IDs are reused. This is because by the time the processing of the latter 4 elements begins, the processing of the first 4 elements is already finished and cleaned up (This is how the pool is supposed to work). If you change the number of processes in the pool to 8, you will see that all IDs will be unique.

And for the 2nd question:

> Given that multiprocessing spawns separate Python interpreters, does this mean there will be N instances of the data inherited through the parent class?

Yes, your understanding is correct. Instances can be placed at the same address as described above, but not simultaneously, so they are different instances. Note, however, that if you set Pool(N), there will be N+1 instances, including one in the main process.

huangapple
  • 本文由 发表于 2023年6月13日 05:46:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460519.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定