2023年6月29日 19:20:03go评论115阅读模式

英文:

Can looping over object instantiations cause a memory leak in Python?

问题

I'm running an agent-based model in Python 3.9 using object-oriented programming. The point of the model is to simulate a predator-prey-population in a changing landscape. When I try to run multiple simulations using a for-loop, the runtime for one simulation increases with each run. I'm suspecting there is some sort of memory leak, but I'm not able to figure it out.

Here is a sketch of my code:

# Parameters
n_deers = ...
n_wolves = ...
# etc.
# Functions
def some_function(arg):
    pass 
# Helper objects
some_dict = ...
# Classes
class Deer:
   pass
class Wolf:
   pass
class Environment:
   def __init__(self):
      self.deers = [Deer(ID = i) for i in range(n_deers)]
      self.wolves = [Wolf(ID = i) for i in range(n_wolves)]
      
      self.data = pd.DataFrame()
   def simulation(self):
      pass
# Simulations
for i in range(100):
     environment = Environment()
     environment.simulation()
     environment.data.to_csv()

In words: I have global parameters, global functions, and a global dictionary that the class instances use. There is a class for each type of animal, and there is a class for the environment that generates a certain number of each animal inside the environment. The environment tracks these animals in a data frame during one run of simulation, in which the animals move, feed, reproduce, die, etc.

My fear is that somehow the instances of the animals (at a full-length simulation around 7000 animals per simulation) are being dragged along in the memory. I don't have static class variables as this article warns: https://theorangeone.net/posts/static-vars/. But of course, this could be anything.

Do you have an idea what could be causing this? Any help is greatly appreciated.

EDIT

I have been able (it seems) to isolate the problem. It seems to originate from the animal movement. Here is a minimal reproducible example. As explanation: If I have the animals choose their next position at random from the adjacent cells, the problem does not seem to occur. Once I add memory, home ranges, and the function cell_choice(), the simulations take longer over time. On my machine, with this parametrization, the first simulation takes between 3 and 4 seconds, and the last between 10 and 11.

# MINIMAL MOVEMENT MODEL
# IMPORTS
import random as rd
import numpy as np
import time
import psutil
# REPRODUCIBILITY
rd.seed(42)
# PARAMETERS
landscape_size = 11
n_deers = 100
years = 10
length_year = 360
timesteps = years*length_year
n_simulations = 20
# HELPER FUNCTIONS AND OBJECTS
# Landscape for first initialization
mock_landscape = np.zeros((landscape_size,landscape_size))
# Function to return a list of nxn cells around a given cell
def range_finder(matrix, position, radius):
    adj = []
    
    lower = 0 - radius
    upper = 1 + radius
    
    for dx in range(lower, upper):
        for dy in range(lower, upper):
            rangeX = range(0, matrix.shape[0])  # Identifies X bounds
            rangeY = range(0, matrix.shape[1])  # Identifies Y bounds
            
            (newX, newY) = (position[0]+dx, position[1]+dy)  # Identifies adjacent cell
            
            if (newX in rangeX) and (newY in rangeY) and (dx, dy) != (0, 0):
                adj.append((newX, newY))
    
    return adj
# Nested dictionary that contains all sets of neighbors for all possible distances up to half the landscape size
neighbor_dict = {d: {(i,j): range_finder(mock_landscape, (i,j), d)
                     for i in range(landscape_size) for j in range(landscape_size)}
                 for d in range(1,int(landscape_size/2)+1)}
# Function that picks the cell in the home range that was visited longest ago
def cell_choice(position, home_range, memory):
     # These are all the adjacent cells to the current position
     adjacent_cells = neighbor_dict[1][position]
     # This is the subset of cells of the adjacent cells belonging to homerange
     possible_choices = [i for i in adjacent_cells if i in home_range]
     # This yields the "master" indeces of those choices
     indeces = []
     for i in possible_choices:
         indeces.append(home_range.index(i))
     # This picks the index with the maximum value in the memory (ie visited longest ago)
     memory_values = [memory[i] for i in indeces]
     pick_index = indeces[memory_values.index(max(memory_values))]
     # Sets that values memory to zero
     memory[pick_index] = 0
     # # Adds one period to every other index
     other_indeces = [i for i in list(range(len(memory))) if i != pick_index]
     for i in other_indeces:
         memory[i] += 1
     # Returns the picked cell
     return home_range[pick_index]
# CLASS DEFINITIONS
class Deer:
    
    def __init__(self, ID):
        
        self.ID = ID
        self.position = (rd.randint(0,landscape_size-1),rd.randint(0,landscape_size-1))
        # Sets up a counter how long the deer has been in the cell
        self.time_spent_in_cell = 1
        
        # Defines a distance parameter that specifies the radius of the homerange around the base
        self.movement_radius = 1
        
        # Defines an initial home range around the position
        self.home_range = neighbor_dict[self.movement_radius][self.position]
        self.home_range.append(self.position)
        
        # Sets up a list of counters how long ago cells in the home range have been visited
        self.memory = [float('inf')]*len(self.home_range)
        self.memory[self.home_range.index(self.position)] = 0
    def move(self):
        self.position = cell_choice(self.position, self.home_range, self.memory)
class Environment:
    
    def __init__(self):
        self.landscape = np.zeros((landscape_size, landscape_size))
        self.deers = [Deer(ID = i) for i in range(n_deers)]
    def simulation(self):
        for timestep in range(timesteps):
            for deer in self.deers:
                deer.move()
                
# SIMULATIONS
process = psutil.Process()
times = []
memory = []
for i in range(1,n_simulations+1):
    print(i, " out of ",n_simulations)
    start_time = time.time()
    environment = Environment()
    environment.simulation()
    times.append(time.time() - start_time)
    memory.append(process.memory_info().rss)
    
print(times)
print(memory)

英文:

Here is a sketch of my code:

# Parameters
n_deers = ...
n_wolves = ...
# etc.
# Functions
def some_function(arg):
pass 
# Helper objects
some_dict = ...
# Classes
class Deer:
pass
class Wolf:
pass
class Environment:
def __init__(self):
self.deers = [Deer(ID = i) for i in range(n_deers)]
self.wolves = [Wolf(ID = i) for i in range(n_wolves)]
self.data = pd.DataFrame()
def simulation(self):
pass
# Simulations
for i in range(100):
environment = Environment()
environment.simulation()
environment.data.to_csv()

My fear is that somehow the instances of the animals (at a full length-simulation around 7000 animals per simulation) are being dragged along in the memory. I don't have static class variables as this article warns: <https://theorangeone.net/posts/static-vars/> . But of course, this could be anything.

Do you have an idea what could be causing this? Any help is greatly appreciated.

EDIT

# MINIMAL MOVEMENT MODEL
# IMPORTS
import random as rd
import numpy as np
import time
import psutil
# REPRODUCIBILITY
rd.seed(42)
# PARAMETERS
landscape_size = 11
n_deers = 100
years = 10
length_year = 360
timesteps = years*length_year
n_simulations = 20
# HELPER FUNCTIONS AND OBJECTS
# Landscape for first initialization
mock_landscape = np.zeros((landscape_size,landscape_size))
# Function to return a list of nxn cells around a given cell
def range_finder(matrix, position, radius):
adj = []
lower = 0 - radius
upper = 1 + radius
for dx in range(lower, upper):
for dy in range(lower, upper):
rangeX = range(0, matrix.shape[0])  # Identifies X bounds
rangeY = range(0, matrix.shape[1])  # Identifies Y bounds
(newX, newY) = (position[0]+dx, position[1]+dy)  # Identifies adjacent cell
if (newX in rangeX) and (newY in rangeY) and (dx, dy) != (0, 0):
adj.append((newX, newY))
return adj
# Nested dictionary that contains all sets of neighbors for all possible distances up to half the landscape size
neighbor_dict = {d: {(i,j): range_finder(mock_landscape, (i,j), d)
for i in range(landscape_size) for j in range(landscape_size)}
for d in range(1,int(landscape_size/2)+1)}
# Function that picks the cell in the home range that was visited longest ago
def cell_choice(position, home_range, memory):
# These are all the adjacent cells to the current position
adjacent_cells = neighbor_dict[1][position]
# This is the subset of cells of the adjacent cells belonging to homerange
possible_choices = [i for i in adjacent_cells if i in home_range]
# This yields the &quot;master&quot; indeces of those choices
indeces = []
for i in possible_choices:
indeces.append(home_range.index(i))
# This picks the index with the maximum value in the memory (ie visited longest ago)
memory_values = [memory[i] for i in indeces]
pick_index = indeces[memory_values.index(max(memory_values))]
# Sets that values memory to zero
memory[pick_index] = 0
# # Adds one period to every other index
other_indeces = [i for i in list(range(len(memory))) if i != pick_index]
for i in other_indeces:
memory[i] += 1
# Returns the picked cell
return home_range[pick_index]
# CLASS DEFINITIONS
class Deer:
def __init__(self, ID):
self.ID = ID
self.position = (rd.randint(0,landscape_size-1),rd.randint(0,landscape_size-1))
# Sets up a counter how long the deer has been in the cell
self.time_spent_in_cell = 1
# Defines a distance parameter that specifies the radius of the homerange around the base
self.movement_radius = 1
# Defines an initial home range around the position
self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)
# Sets up a list of counters how long ago cells in the home range have been visited
self.memory = [float(&#39;inf&#39;)]*len(self.home_range)
self.memory[self.home_range.index(self.position)] = 0
def move(self):
self.position = cell_choice(self.position, self.home_range, self.memory)
class Environment:
def __init__(self):
self.landscape = np.zeros((landscape_size, landscape_size))
self.deers = [Deer(ID = i) for i in range(n_deers)]
def simulation(self):
for timestep in range(timesteps):
for deer in self.deers:
deer.move()
# SIMULATIONS
process = psutil.Process()
times = []
memory = []
for i in range(1,n_simulations+1):
print(i, &quot; out of &quot;,n_simulations)
start_time = time.time()
environment = Environment()
environment.simulation()
times.append(time.time() - start_time)
memory.append(process.memory_info().rss)
print(times)
print(memory)

答案1

得分: 2

在Deer构造函数中的这几行会有问题：

第一行使得名字 self.home_range 指向了 neighbor_dict 中一个内部字典的列表对象（这个列表对象最初是通过调用 range_finder 函数返回的）。

然后第二行改变了这个列表。这意味着后续从 neighbor_dict 检索得到的将是这个已经改变的列表的最新版本，而不是 range_finder 最初返回的值。

这些列表对象的不断增长可能会导致一些减慢，同时也会使得你的模拟结果不正确。

你可以通过让 self.home_range 指向这个列表的副本来修复这个问题。一种方法是：

self.home_range = neighbor_dict[self.movement_radius][self.position].copy()

如果你喜欢的话，也有一些备选的语法选择。参见如何克隆一个列表，使其在赋值后不会意外更改？。

关于 Python 中名字如何指向对象的总结，也可以参见 Ned Batchelder 的“关于 Python 名字和值的事实与神话”。

英文:

These lines in the constructor of Deer will be problematic:

self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)

The first line makes the name self.home_range refer to a list object in an inner dictionary of neighbor_dict (a list object originally returned from calling the range_finder function).

Then the second line mutates that list. This means that subsequent retrievals from neighbor_dict will get the latest version of that mutated list, not the value originally returned by range_finder.

The growing sizes of these list objects will likely cause some slowdown, but also make your simulation results incorrect.

You should be able to fix this by making self.home_range refer to a copy of the list. One way to do that is:

self.home_range = neighbor_dict[self.movement_radius][self.position].copy()

There are some alternative syntactic choices for that if you prefer. See How do I clone a list so that it doesn't change unexpectedly after assignment?.

For a summary of how names refer to objects in Python, see also Ned Batchelder's "Facts and myths about Python names and values".

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

循环遍历对象实例化会导致Python内存泄漏吗？

问题

答案1

Python 3.10 无法导入 pyzbar

String Manipulation in Numba Cuda: Clip first k characters from a string array, k comes from another array

如何使用滑块自动更新与数据框变化的图表。

如何在Azure API运行函数内执行多进程操作

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。