英文:
Can looping over object instantiations cause a memory leak in Python?
问题
I'm running an agent-based model in Python 3.9 using object-oriented programming. The point of the model is to simulate a predator-prey-population in a changing landscape. When I try to run multiple simulations using a for-loop, the runtime for one simulation increases with each run. I'm suspecting there is some sort of memory leak, but I'm not able to figure it out.
Here is a sketch of my code:
# Parameters
n_deers = ...
n_wolves = ...
# etc.
# Functions
def some_function(arg):
pass
# Helper objects
some_dict = ...
# Classes
class Deer:
pass
class Wolf:
pass
class Environment:
def __init__(self):
self.deers = [Deer(ID = i) for i in range(n_deers)]
self.wolves = [Wolf(ID = i) for i in range(n_wolves)]
self.data = pd.DataFrame()
def simulation(self):
pass
# Simulations
for i in range(100):
environment = Environment()
environment.simulation()
environment.data.to_csv()
In words: I have global parameters, global functions, and a global dictionary that the class instances use. There is a class for each type of animal, and there is a class for the environment that generates a certain number of each animal inside the environment. The environment tracks these animals in a data frame during one run of simulation, in which the animals move, feed, reproduce, die, etc.
My fear is that somehow the instances of the animals (at a full-length simulation around 7000 animals per simulation) are being dragged along in the memory. I don't have static class variables as this article warns: https://theorangeone.net/posts/static-vars/. But of course, this could be anything.
Do you have an idea what could be causing this? Any help is greatly appreciated.
EDIT
I have been able (it seems) to isolate the problem. It seems to originate from the animal movement. Here is a minimal reproducible example. As explanation: If I have the animals choose their next position at random from the adjacent cells, the problem does not seem to occur. Once I add memory, home ranges, and the function cell_choice()
, the simulations take longer over time. On my machine, with this parametrization, the first simulation takes between 3 and 4 seconds, and the last between 10 and 11.
# MINIMAL MOVEMENT MODEL
# IMPORTS
import random as rd
import numpy as np
import time
import psutil
# REPRODUCIBILITY
rd.seed(42)
# PARAMETERS
landscape_size = 11
n_deers = 100
years = 10
length_year = 360
timesteps = years*length_year
n_simulations = 20
# HELPER FUNCTIONS AND OBJECTS
# Landscape for first initialization
mock_landscape = np.zeros((landscape_size,landscape_size))
# Function to return a list of nxn cells around a given cell
def range_finder(matrix, position, radius):
adj = []
lower = 0 - radius
upper = 1 + radius
for dx in range(lower, upper):
for dy in range(lower, upper):
rangeX = range(0, matrix.shape[0]) # Identifies X bounds
rangeY = range(0, matrix.shape[1]) # Identifies Y bounds
(newX, newY) = (position[0]+dx, position[1]+dy) # Identifies adjacent cell
if (newX in rangeX) and (newY in rangeY) and (dx, dy) != (0, 0):
adj.append((newX, newY))
return adj
# Nested dictionary that contains all sets of neighbors for all possible distances up to half the landscape size
neighbor_dict = {d: {(i,j): range_finder(mock_landscape, (i,j), d)
for i in range(landscape_size) for j in range(landscape_size)}
for d in range(1,int(landscape_size/2)+1)}
# Function that picks the cell in the home range that was visited longest ago
def cell_choice(position, home_range, memory):
# These are all the adjacent cells to the current position
adjacent_cells = neighbor_dict[1][position]
# This is the subset of cells of the adjacent cells belonging to homerange
possible_choices = [i for i in adjacent_cells if i in home_range]
# This yields the "master" indeces of those choices
indeces = []
for i in possible_choices:
indeces.append(home_range.index(i))
# This picks the index with the maximum value in the memory (ie visited longest ago)
memory_values = [memory[i] for i in indeces]
pick_index = indeces[memory_values.index(max(memory_values))]
# Sets that values memory to zero
memory[pick_index] = 0
# # Adds one period to every other index
other_indeces = [i for i in list(range(len(memory))) if i != pick_index]
for i in other_indeces:
memory[i] += 1
# Returns the picked cell
return home_range[pick_index]
# CLASS DEFINITIONS
class Deer:
def __init__(self, ID):
self.ID = ID
self.position = (rd.randint(0,landscape_size-1),rd.randint(0,landscape_size-1))
# Sets up a counter how long the deer has been in the cell
self.time_spent_in_cell = 1
# Defines a distance parameter that specifies the radius of the homerange around the base
self.movement_radius = 1
# Defines an initial home range around the position
self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)
# Sets up a list of counters how long ago cells in the home range have been visited
self.memory = [float('inf')]*len(self.home_range)
self.memory[self.home_range.index(self.position)] = 0
def move(self):
self.position = cell_choice(self.position, self.home_range, self.memory)
class Environment:
def __init__(self):
self.landscape = np.zeros((landscape_size, landscape_size))
self.deers = [Deer(ID = i) for i in range(n_deers)]
def simulation(self):
for timestep in range(timesteps):
for deer in self.deers:
deer.move()
# SIMULATIONS
process = psutil.Process()
times = []
memory = []
for i in range(1,n_simulations+1):
print(i, " out of ",n_simulations)
start_time = time.time()
environment = Environment()
environment.simulation()
times.append(time.time() - start_time)
memory.append(process.memory_info().rss)
print(times)
print(memory)
英文:
I'm running an agent-based model in Python 3.9 using object-oriented programming. The point of the model is to simulate a predator-prey-population in a changing landscape. When I try to run multiple simulations using a for-loop, the runtime for one simulation increases with each run. I'm suspecting there is some sort of memory leak, but I'm not able to figure it out.
Here is a sketch of my code:
# Parameters
n_deers = ...
n_wolves = ...
# etc.
# Functions
def some_function(arg):
pass
# Helper objects
some_dict = ...
# Classes
class Deer:
pass
class Wolf:
pass
class Environment:
def __init__(self):
self.deers = [Deer(ID = i) for i in range(n_deers)]
self.wolves = [Wolf(ID = i) for i in range(n_wolves)]
self.data = pd.DataFrame()
def simulation(self):
pass
# Simulations
for i in range(100):
environment = Environment()
environment.simulation()
environment.data.to_csv()
In words: I have global parameters, global functions, and a global dictionary that the class instances use. There is a class for each type of animal, and there is a class for the environment that generates a certain number of each animal inside the environment. The environment tracks these animals in a data frame during one run of simulation, in which the animals move, feed, reproduce, die etc.
My fear is that somehow the instances of the animals (at a full length-simulation around 7000 animals per simulation) are being dragged along in the memory. I don't have static class variables as this article warns: <https://theorangeone.net/posts/static-vars/> . But of course, this could be anything.
Do you have an idea what could be causing this? Any help is greatly appreciated.
EDIT
I have been able (it seems) to isolate the problem. It seems to originate from the animal movement. Here is a minimal reproducible example. As explanation: If I have the animals choose their next position at random from the adjacent cells, the problem does not seem to occur. Once I add memory, home ranges, and the function cell_choice()
, the simulations take longer over time. On my machine, with this parametrization, the first simulation takes between 3 and 4 seconds, and the last between 10 and 11.
# MINIMAL MOVEMENT MODEL
# IMPORTS
import random as rd
import numpy as np
import time
import psutil
# REPRODUCIBILITY
rd.seed(42)
# PARAMETERS
landscape_size = 11
n_deers = 100
years = 10
length_year = 360
timesteps = years*length_year
n_simulations = 20
# HELPER FUNCTIONS AND OBJECTS
# Landscape for first initialization
mock_landscape = np.zeros((landscape_size,landscape_size))
# Function to return a list of nxn cells around a given cell
def range_finder(matrix, position, radius):
adj = []
lower = 0 - radius
upper = 1 + radius
for dx in range(lower, upper):
for dy in range(lower, upper):
rangeX = range(0, matrix.shape[0]) # Identifies X bounds
rangeY = range(0, matrix.shape[1]) # Identifies Y bounds
(newX, newY) = (position[0]+dx, position[1]+dy) # Identifies adjacent cell
if (newX in rangeX) and (newY in rangeY) and (dx, dy) != (0, 0):
adj.append((newX, newY))
return adj
# Nested dictionary that contains all sets of neighbors for all possible distances up to half the landscape size
neighbor_dict = {d: {(i,j): range_finder(mock_landscape, (i,j), d)
for i in range(landscape_size) for j in range(landscape_size)}
for d in range(1,int(landscape_size/2)+1)}
# Function that picks the cell in the home range that was visited longest ago
def cell_choice(position, home_range, memory):
# These are all the adjacent cells to the current position
adjacent_cells = neighbor_dict[1][position]
# This is the subset of cells of the adjacent cells belonging to homerange
possible_choices = [i for i in adjacent_cells if i in home_range]
# This yields the "master" indeces of those choices
indeces = []
for i in possible_choices:
indeces.append(home_range.index(i))
# This picks the index with the maximum value in the memory (ie visited longest ago)
memory_values = [memory[i] for i in indeces]
pick_index = indeces[memory_values.index(max(memory_values))]
# Sets that values memory to zero
memory[pick_index] = 0
# # Adds one period to every other index
other_indeces = [i for i in list(range(len(memory))) if i != pick_index]
for i in other_indeces:
memory[i] += 1
# Returns the picked cell
return home_range[pick_index]
# CLASS DEFINITIONS
class Deer:
def __init__(self, ID):
self.ID = ID
self.position = (rd.randint(0,landscape_size-1),rd.randint(0,landscape_size-1))
# Sets up a counter how long the deer has been in the cell
self.time_spent_in_cell = 1
# Defines a distance parameter that specifies the radius of the homerange around the base
self.movement_radius = 1
# Defines an initial home range around the position
self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)
# Sets up a list of counters how long ago cells in the home range have been visited
self.memory = [float('inf')]*len(self.home_range)
self.memory[self.home_range.index(self.position)] = 0
def move(self):
self.position = cell_choice(self.position, self.home_range, self.memory)
class Environment:
def __init__(self):
self.landscape = np.zeros((landscape_size, landscape_size))
self.deers = [Deer(ID = i) for i in range(n_deers)]
def simulation(self):
for timestep in range(timesteps):
for deer in self.deers:
deer.move()
# SIMULATIONS
process = psutil.Process()
times = []
memory = []
for i in range(1,n_simulations+1):
print(i, " out of ",n_simulations)
start_time = time.time()
environment = Environment()
environment.simulation()
times.append(time.time() - start_time)
memory.append(process.memory_info().rss)
print(times)
print(memory)
答案1
得分: 2
在Deer
构造函数中的这几行会有问题:
第一行使得名字 self.home_range
指向了 neighbor_dict
中一个内部字典的列表对象(这个列表对象最初是通过调用 range_finder
函数返回的)。
然后第二行改变了这个列表。这意味着后续从 neighbor_dict
检索得到的将是这个已经改变的列表的最新版本,而不是 range_finder
最初返回的值。
这些列表对象的不断增长可能会导致一些减慢,同时也会使得你的模拟结果不正确。
你可以通过让 self.home_range
指向这个列表的 副本 来修复这个问题。一种方法是:
self.home_range = neighbor_dict[self.movement_radius][self.position].copy()
如果你喜欢的话,也有一些备选的语法选择。参见 如何克隆一个列表,使其在赋值后不会意外更改?。
关于 Python 中名字如何指向对象的总结,也可以参见 Ned Batchelder 的“关于 Python 名字和值的事实与神话”。
英文:
These lines in the constructor of Deer
will be problematic:
self.home_range = neighbor_dict[self.movement_radius][self.position]
self.home_range.append(self.position)
The first line makes the name self.home_range
refer to a list object in an inner dictionary of neighbor_dict
(a list object originally returned from calling the range_finder
function).
Then the second line mutates that list. This means that subsequent retrievals from neighbor_dict
will get the latest version of that mutated list, not the value originally returned by range_finder
.
The growing sizes of these list objects will likely cause some slowdown, but also make your simulation results incorrect.
You should be able to fix this by making self.home_range
refer to a copy of the list. One way to do that is:
self.home_range = neighbor_dict[self.movement_radius][self.position].copy()
There are some alternative syntactic choices for that if you prefer. See How do I clone a list so that it doesn't change unexpectedly after assignment?.
For a summary of how names refer to objects in Python, see also Ned Batchelder's "Facts and myths about Python names and values".
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论