英文:
pandas series to_json memory leak
问题
我的生产服务的内存一直在不断增加,我认为根本原因是pandas.Series.to_json。
output:
gc_count=46619
gc_count=46619
gc_count=46620
gc_count=46621
gc_count=46622
gc_count=46623
gc_count=46624
gc_count=46625
gc_count=46626
gc_count=46627
有趣的是,第一次和第二次调用总是具有相同的GC计数,然后在每次迭代中增加一个。
有人之前遇到过这个问题吗?有没有办法避免内存泄漏?
[尝试过的Python版本:3.8和3.9]
更新:这似乎与 https://github.com/pandas-dev/pandas/issues/24889 有关,使用to_dict并使用json进行转换似乎是一种解决方法。
英文:
My production service's memory was constantly increasing, and I think the root cause is the pandas.Series.to_json.
import pandas as pd
import gc
for i in range(0,10):
series = pd.Series([0.008, 0.002])
json_string = series.to_json(orient="records")
_ = gc.collect()
print("gc_count={}".format(len(gc.get_objects())))
output:
gc_count=46619
gc_count=46619
gc_count=46620
gc_count=46621
gc_count=46622
gc_count=46623
gc_count=46624
gc_count=46625
gc_count=46626
gc_count=46627
What's interesting is that the first and the second call always has the same GC count, and then it starts increasing by one in each iteration.
Has anyone faced this before? Are there ways to avoid the memory leak?
[Python versions tried: 3.8 and 3.9]
Update: This seems to be related: https://github.com/pandas-dev/pandas/issues/24889 and using to_dict and converting it using json seems to be a workaround.
答案1
得分: 1
这个错误似乎在最新版本的pandas中已经修复。这个错误存在于pandas 1.1.3中,可以被稳定地重现。
可能的解决方案
- 升级到最新版本的Pandas。
- 如果必须使用较旧版本的Pandas,可以使用以下解决方法:
而不是
series.to_json(orient="records")
我们可以使用
str(list(series.to_dict().values()))
英文:
The bug seems to be fixed in the latest version of pandas. The bug is there in pandas 1.1.3, and can be reproduced consistently.
Possible solutions
- Upgrade to the latest version of Pandas.
- If you have to use older version of Pandas, we can have a workaround like the following:
Instead of
series.to_json(orient="records")
We can do
str(list(series.to_dict().values()))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论