将数据框定期写入磁盘,而不是无限增长的内存使用

huangapple go评论64阅读模式
英文:

Writing Dataframes to disk periodically instead of infinitely growing memory usage

问题

I'm collecting a bunch of data from an ever-increasing pool of sites in dataframes, processing that data, combining it, and ultimately saving it to disk.

This has worked well so far, but I'm now coming to the end of my server's memory capacity. Although the memory consumption is far greater than the size of my DFs (and there's definitely optimization that can be done there), to be able to scale this process at some point I will need to look at writing to disk and not having everything stored in memory. I might as well do this properly now, instead of optimizing something I will have to re-do at some point in the future anyway.

How would you structure a re-factor of this sort? My idea is:

  1. Continue to use a dataframe per site
  2. Process dataframe
  3. Save to disk
  4. Combine all saved DFs from disk into a single output file at the end of the run

Thanks

英文:

I'm collecting a bunch of data from an ever increasing pool of sites in dataframes, processing that data, combining it and ultimately saving it to disk.

This has worked well so far but I'm now coming to the end of my server's memory capacity. Although the memory consumption is far greater than the size of my DFs (and there's definitely optimization that can be done there), to be able to scale this process at some point I will need to look at writing to disk and not having everything stored in memory. I might as well do this properly now, instead of optimizing something I will have to re-do at some point in the future anyway.

How would you structure a re-factor of this sort? My idea is:

  1. Continue to use a dataframe per site
  2. Process dataframe
  3. Save to disk
  4. Combine all saved DFs from disk into a single output file at the end of the run

Thanks

答案1

得分: 1

I faced this same issue last week. The solution I came up with was batching.
我上周遇到了同样的问题。我想出的解决办法是分批处理。

I experimented on my machine, and found that 50,000 rows per dataframe is a good number. I keep writing to a dataframe until it reaches this number, and then I save it, add an index to the title, and move on to a new clean dataframe. I prefer to overwrite on top of the old one to avoid any issues with the grabage collector.
我在我的机器上进行了实验,发现每个数据框有50,000行是一个不错的数量。我一直向数据框中写入数据,直到达到这个数量,然后保存它,为标题添加索引,然后开始使用一个新的干净数据框。我更喜欢在旧的数据框上进行覆盖,以避免与垃圾收集器的任何问题。

Combining the dataframes could be done with multiple ways that are light on the memory. For instance, to avoid loading all of them in the memory and concatenating them into a very big dataframe, you can simply seek to the end of the target file, write directly to it, and repeat.
合并这些数据框可以通过多种占用内存较少的方式来实现。例如,为了避免将它们全部加载到内存中并将它们连接成一个非常大的数据框,你可以简单地定位到目标文件的末尾,直接向其中写入数据,然后重复这个过程。

This would not lead to any issues in the memory, as at any point in time, only 1 dataframe is loaded in the memory.
这不会导致内存中出现任何问题,因为在任何时间点,只有一个数据框被加载到内存中。

英文:

I faced this same issue last week. The solution I came up with was batching.
I experimented on my machine, and found that 50,000 rows per dataframe is a good number. I keep writing to a dataframe until it reaches this number, and then I save it, add an index to the title, and move on to a new clean dataframe. I prefer to overwrite on top of the old one to avoid any issues with the grabage collector.

Combining the dataframes could be done with multiple ways that are light on the memory. For instance, to avoid loading all of them in the memory and concatenating them into a very big dataframe, you can simply seek to the end of the target file, write directly to it, and repeat.

This would not lead to any issues in the memory, as at any point in time, only 1 dataframe is loaded in the memory.

答案2

得分: 0

将数据框写入 SQL。在需要时查询所需的和经过筛选的数据。

英文:

Write data frames to sql.
Query the required and filtered data when needed.

huangapple
  • 本文由 发表于 2023年3月4日 00:52:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629829.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定