2023年6月1日 11:11:20go评论90阅读模式

英文:

Panda's Merge is exploding in memory

问题

I am trying to merge 2 dataframes and for some reason it blows out of proportion in memory.
those 2 dataframes are relatively large but nothing out of the ordinary. I am using a strong machine with 128GB of RAM.
This is the output:

MemoryError: Unable to allocate 1.58 TiB for an array with shape (216639452968,) and data type int64

I am probably doing something wrong here but can't understand why it gets to a 1.6 TB of memory requirment.

Here is some info on the dataframes:

print(a.info())
print(a.memory_usage(deep=True))
print(b.info())
print(b.memory_usage(deep=True))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10092079 entries, 0 to 10092078
Data columns (total 2 columns):
 #   Column  Dtype         
---  ------  -----         
 0   id      object        
 1   date    datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 154.0+ MB
None
Index          128
id       654665935
date      80736632
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000000 entries, 0 to 14999999
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   id1     object
 1   id2     object
dtypes: object(2)
memory usage: 228.9+ MB
None
Index          128
id1      965676606
id2      718661312
dtype: int64

英文:

x = b.merge(a,left_on=&#39;id1&#39;,right_on=&#39;id&#39;,how=&#39;left&#39;)

This is the output:

MemoryError: Unable to allocate 1.58 TiB for an array with shape (216639452968,) and data type int64

I am probably doing something wrong here but can't understand why it gets to a 1.6 TB of memory requirment.

Here is some info on the dataframes:

print(a.info())
print(a.memory_usage(deep=True))
print(b.info())
print(b.memory_usage(deep=True))


&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 10092079 entries, 0 to 10092078
Data columns (total 2 columns):
 #   Column  Dtype         
---  ------  -----         
 0   id      object        
 1   date    datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 154.0+ MB
None
Index          128
id       654665935
date      80736632
dtype: int64
&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 15000000 entries, 0 to 14999999
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   id1     object
 1   id2     object
dtypes: object(2)
memory usage: 228.9+ MB
None
Index          128
id1      965676606
id2      718661312
dtype: int64

答案1

得分: 1

一个可能的选择是每个数据集（a 和 b）中都有相同 id 的多行。
如果在 "a" 中有 id "123" 重复出现 10 次，而在 "b" 中也有相同的 id "123" 重复出现 5 次 - 最终的数据框将有 50 行，其中 id 为 "123"。

请确保数据集中没有重复的 id。如果有重复的 id - 请确保您确实需要这些重复项（或者可能使用 groupby(id).agg(...) 来以您希望的某种方式移除重复项）。

如果我的解决方案没有帮助 - 请提供有关每个数据集的 id 列的唯一值数量的更多信息。

英文:

A possible option would be that there are multiple rows with the same id in each data (a and b).
If you have in "a" the id "123" 10 times and the same id "123" in "b" 5 times- the resulting dataframe would have 50 rows with id "123".

Make sure you don't have duplicate ids in the datasets. If you do- make sure you really need the duplicates (or maybe groupby(id).agg(...) to remove the duplicates in some way you'd like).

If my solution doesn't assist- please add some more information about the amount of unique values of the id column of each dataset.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

熊猫的合并在内存中出现了问题。

问题

答案1

无法从’tensorflow.python.framework’中导入’name’类型规范。

在 pandas 中减去日期列时出现 OverflowError

Why are the weights not updating when splitting the model into two `class` in pytorch and torch-geometric?

Error resulting from running code in Colab

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。