循环遍历对象实例化会导致Python内存泄漏吗?

huangapple go评论87阅读模式
英文:

Can looping over object instantiations cause a memory leak in Python?

问题

I'm running an agent-based model in Python 3.9 using object-oriented programming. The point of the model is to simulate a predator-prey-population in a changing landscape. When I try to run multiple simulations using a for-loop, the runtime for one simulation increases with each run. I'm suspecting there is some sort of memory leak, but I'm not able to figure it out.

Here is a sketch of my code:

# Parameters
n_deers = ...
n_wolves = ...
# etc.

# Functions
def some_function(arg):
    pass 

# Helper objects
some_dict = ...

# Classes
class Deer:
   pass

class Wolf:
   pass

class Environment:

   def __init__(self):
      self.deers = [Deer(ID = i) for i in range(n_deers)]
      self.wolves = [Wolf(ID = i) for i in range(n_wolves)]
      
      self.data = pd.DataFrame()

   def simulation(self):
      pass

# Simulations
for i in range(100):
     environment = Environment()
     environment.simulation()
     environment.data.to_csv()

In words: I have global parameters, global functions, and a global dictionary that the class instances use. There is a class for each type of animal, and there is a class for the environment that generates a certain number of each animal inside the environment. The environment tracks these animals in a data frame during one run of simulation, in which the animals move, feed, reproduce, die, etc.

My fear is that somehow the instances of the animals (at a full-length simulation around 7000 animals per simulation) are being dragged along in the memory. I don't have static class variables as this article warns: https://theorangeone.net/posts/static-vars/. But of course, this could be anything.

Do you have an idea what could be causing this? Any help is greatly appreciated.

EDIT

I have been able (it seems) to isolate the problem. It seems to originate from the animal movement. Here is a minimal reproducible example. As explanation: If I have the animals choose their next position at random from the adjacent cells, the problem does not seem to occur. Once I add memory, home ranges, and the function cell_choice(), the simulations take longer over time. On my machine, with this parametrization, the first simulation takes between 3 and 4 seconds, and the last between 10 and 11.

# MINIMAL MOVEMENT MODEL

# IMPORTS
import random as rd
import numpy as np
import time
import psutil

# REPRODUCIBILITY
rd.seed(42)

# PARAMETERS
landscape_size = 11
n_deers = 100
years = 10
length_year = 360
timesteps = years*length_year
n_simulations = 20

# HELPER FUNCTIONS AND OBJECTS
# Landscape for first initialization
mock_landscape = np.zeros((landscape_size,landscape_size))

# Function to return a list of nxn cells around a given cell
def range_finder(matrix, position, radius):
    adj = []
    
    lower = 0 - radius
    upper = 1 + radius
    
    for dx in range(lower, upper):
        for dy in range(lower, upper):
            rangeX = range(0, matrix.shape[0])  # Identifies X bounds
            rangeY = range(0, matrix.shape[1])  # Identifies Y bounds
            
            (newX, newY) = (position[0]+dx, position[1]+dy)  # Identifies adjacent cell
            
            if (newX in rangeX) and (newY in rangeY) and (dx, dy) != (0, 0):
                adj.append((newX, newY))
    
    return adj

# Nested dictionary that contains all sets of neighbors for all possible distances up to half the landscape size
neighbor_dict = {d: {(i,j): range_finder(mock_landscape, (i,j), d)
                     for i in range(landscape_size) for j in range(landscape_size)}
                 for d in range(1,int(landscape_size/2)+1)}

# Function that picks the cell in the home range that was visited longest ago
def cell_choice(position, home_range, memory):
     # These are all the adjacent cells to the current position
     adjacent_cells = neighbor_dict[1][position]
     # This is the subset of cells of the adjacent cells belonging to homerange
     possible_choices = [i for i in adjacent_cells if i in home_range]
     # This yields the "master" indeces of those choices
     indeces = []
     for i in possible_choices:
         indeces.append(home_range.index(i))
     # This picks the index with the maximum value in the memory (ie visited longest ago)
     memory_values = [memory[i] for i in indeces]
     pick_index = indeces[memory_values.index(max(memory_values))]
     # Sets that values memory to zero
     memory[pick_index] = 0
     # # Adds one period to every other index
     other_indeces = [i for i in list(range(len(memory))) if i != pick_index]
     for i in other_indeces:
         memory[i] += 1
     # Returns the picked cell
     return home_range[pick_index]

# CLASS DEFINITIONS
class Deer:
    
    def __init__(self, ID):
        
        self.ID = ID
        self.position = (rd.randint(0,landscape_size-1),rd.randint(0,landscape_size-1))
        # Sets up a counter how long the deer has been in the cell
        self.time_spent_in_cell = 1
        
        # Defines a distance parameter that specifies the radius of the homerange around the base
        self.movement_radius = 1
        
        # Defines an initial home range around the position
        self.home_range = neighbor_dict[self.movement_radius][self.position]
        self.home_range.append(self.position)
        
        # Sets up a list of counters how long ago cells in the home range have been visited
        self.memory = [float('inf')]*len(self.home_range)
        self.memory[self.home_range.index(self.position)] = 0

    def move(self):
        self.position = cell_choice(self.position, self.home_range, self.memory)

class Environment:
    
    def __init__(self):
        self.landscape = np.zeros((landscape_size, landscape_size))
        self.deers = [Deer(ID = i) for i in range(n_deers)]

    def simulation(self):
        for timestep in range(timesteps):
            for deer in self.deers:
                deer.move()
                
# SIMULATIONS

process = psutil.Process()

times = []
memory = []

for i in range(1,n_simulations+1):
    print(i, " out of ",n_simulations)
    start_time = time.time()
    environment = Environment()
    environment.simulation()
    times.append(time.time() - start_time)
    memory.append(process.memory_info().rss)
    
print(times)
print(memory)
英文:

I'm running an agent-based model in Python 3.9 using object-oriented programming. The point of the model is to simulate a predator-prey-population in a changing landscape. When I try to run multiple simulations using a for-loop, the runtime for one simulation increases with each run. I'm suspecting there is some sort of memory leak, but I'm not able to figure it out.

Here is a sketch of my code:

# Parameters
n_deers = ...
n_wolves = ...
# etc.
# Functions
def some_function(arg):
pass 
# Helper objects
some_dict = ...
# Classes
class Deer:
pass
class Wolf:
pass
class Environment:
def __init__(self):
self.deers = [Deer(ID = i) for i in range(n_deers)]
self.wolves = [Wolf(ID = i) for i in range(n_wolves)]
self.data = pd.DataFrame()
def simulation(self):
pass
# Simulations
for i in range(100):
environment = Environment()
environment.simulation()
environment.data.to_csv()

In words: I have global parameters, global functions, and a global dictionary that the class instances use. There is a class for each type of animal, and there is a class for the environment that generates a certain number of each animal inside the environment. The environment tracks these animals in a data frame during one run of simulation, in which the animals move, feed, reproduce, die etc.

My fear is that somehow the instances of the animals (at a full length-simulation around 7000 animals per simulation) are being dragged along in the memory. I don't have static class variables as this article warns: <https://theorangeone.net/posts/static-vars/> . But of course, this could be anything.

Do you have an idea what could be causing this? Any help is greatly appreciated.

EDIT

I have been able (it seems) to isolate the problem. It seems to originate from the animal movement. Here is a minimal reproducible example. As explanation: If I have the animals choose their next position at random from the adjacent cells, the problem does not seem to occur. Once I add memory, home ranges, and the function cell_choice(), the simulations take longer over time. On my machine, with this parametrization, the first simulation takes between 3 and 4 seconds, and the last between 10 and 11.

# MINIMAL MOVEMENT MODEL
# IMPORTS
import random as rd
import numpy as np
import time
import psutil
# REPRODUCIBILITY
rd.seed(42)
# PARAMETERS
landscape_size = 11
n_deers = 100
years = 10
length_year = 360
timesteps = years*length_year
n_simulations = 20
# HELPER FUNCTIONS AND OBJECTS
# Landscape for first initialization
mock_landscape = np.zeros((landscape_size,landscape_size))
# Function to return a list of nxn cells around a given cell
def range_finder(matrix, position, radius):
adj = []
lower = 0 - radius
upper = 1 + radius
for dx in range(lower, upper):
for dy in range(lower, upper):
rangeX = range(0, matrix.shape[0])  # Identifies X bounds
rangeY = range(0, matrix.shape[1])  # Identifies Y bounds
(newX, newY) = (position[0]+dx, position[1]+dy)  # Identifies adjacent cell
if (newX in rangeX) and (newY in rangeY) and (dx, dy) != (0, 0):
adj.append((newX, newY))
return adj
# Nested dictionary that contains all sets of neighbors for all possible distances up to half the landscape size
neighbor_dict = {d: {(i,j): range_finder(mock_landscape, (i,j), d)
for i in range(landscape_size) for j in range(landscape_size)}
for d in range(1,int(landscape_size/2)+1)}
# Function that picks the cell in the home range that was visited longest ago
def cell_choice(position, home_range, memory):
# These are all the adjacent cells to the current position
adjacent_cells = neighbor_dict[1][position]
# This is the subset of cells of the adjacent cells belonging to homerange
possible_choices = [i for i in adjacent_cells if i in home_range]
# This yields the &quot;master&quot; indeces of those choices
indeces = []
for i in possible_choices:
indeces.append(home_range.index(i))
# This picks the index with the maximum value in the memory (ie visited longest ago)
memory_values = [memory[i] for i in indeces]
pick_index = indeces[memory_values.index(max(memory_values))]
# Sets that values memory to zero
memory[pick_index] = 0
# # Adds one period to every other index
other_indeces = [i for i in list(range(len(memory))) if i != pick_index]
for i in other_indeces:
memory[i] += 1
# Returns the picked cell
return home_range[pick_index]
# CLASS DEFINITIONS
class Deer:
def __init__(self, ID):
self.ID = ID
self.position = (rd.randint(0,landscape_size-1),rd.randint(0,landscape_size-1))
# Sets up a counter how long the deer has been in the cell
self.time_spent_in_cell = 1
# Defines a distance parameter that specifies the radius of the homerange around the base
self.movement_radius = 1
# Defines an initial home range around the position
self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)
# Sets up a list of counters how long ago cells in the home range have been visited
self.memory = [float(&#39;inf&#39;)]*len(self.home_range)
self.memory[self.home_range.index(self.position)] = 0
def move(self):
self.position = cell_choice(self.position, self.home_range, self.memory)
class Environment:
def __init__(self):
self.landscape = np.zeros((landscape_size, landscape_size))
self.deers = [Deer(ID = i) for i in range(n_deers)]
def simulation(self):
for timestep in range(timesteps):
for deer in self.deers:
deer.move()
# SIMULATIONS
process = psutil.Process()
times = []
memory = []
for i in range(1,n_simulations+1):
print(i, &quot; out of &quot;,n_simulations)
start_time = time.time()
environment = Environment()
environment.simulation()
times.append(time.time() - start_time)
memory.append(process.memory_info().rss)
print(times)
print(memory)

答案1

得分: 2

Deer构造函数中的这几行会有问题:

第一行使得名字 self.home_range 指向了 neighbor_dict 中一个内部字典的列表对象(这个列表对象最初是通过调用 range_finder 函数返回的)。

然后第二行改变了这个列表。这意味着后续从 neighbor_dict 检索得到的将是这个已经改变的列表的最新版本,而不是 range_finder 最初返回的值。

这些列表对象的不断增长可能会导致一些减慢,同时也会使得你的模拟结果不正确。

你可以通过让 self.home_range 指向这个列表的 副本 来修复这个问题。一种方法是:

self.home_range = neighbor_dict[self.movement_radius][self.position].copy()

如果你喜欢的话,也有一些备选的语法选择。参见 如何克隆一个列表,使其在赋值后不会意外更改?

关于 Python 中名字如何指向对象的总结,也可以参见 Ned Batchelder 的“关于 Python 名字和值的事实与神话”

英文:

These lines in the constructor of Deer will be problematic:

self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)

The first line makes the name self.home_range refer to a list object in an inner dictionary of neighbor_dict (a list object originally returned from calling the range_finder function).

Then the second line mutates that list. This means that subsequent retrievals from neighbor_dict will get the latest version of that mutated list, not the value originally returned by range_finder.

The growing sizes of these list objects will likely cause some slowdown, but also make your simulation results incorrect.

You should be able to fix this by making self.home_range refer to a copy of the list. One way to do that is:

self.home_range = neighbor_dict[self.movement_radius][self.position].copy()

There are some alternative syntactic choices for that if you prefer. See How do I clone a list so that it doesn't change unexpectedly after assignment?.

For a summary of how names refer to objects in Python, see also Ned Batchelder's "Facts and myths about Python names and values".

huangapple
  • 本文由 发表于 2023年6月29日 19:20:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76580544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定