2020年7月27日 14:28:29go评论109阅读模式

英文:

Creating dictionary from large Pyspark dataframe showing OutOfMemoryError: Java heap space

问题

I have seen and tried many [existing][1] StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, **My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores**. As per this [suggestion][1] I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
`property_sql_df.show()` shows,
+--------------+------------+--------------------+--------------------+
|            id|country_code|       name|          hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
|  BOND-9129450|          US|Scotron Home w/Ga...|90cb0946cf4139e12...|
|  BOND-1742850|          US|Sited in the Mead...|d5c301f00e9966483...|
|  BOND-3211356|          US|NEW LISTING - Com...|811fa26e240d726ec...|
|  BOND-7630290|          US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
|  BOND-7175508|          US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
What I want is to make a dictionary with hash_of_cc_pn_li as **key** and id as **a list** value.
**Expected Output**
{
  &quot;90cb0946cf4139e12&quot;: [&quot;BOND-9129450&quot;, &quot;BOND-7175508&quot;]
  &quot;d5c301f00e9966483&quot;: [&quot;BOND-1742850&quot;,&quot;BOND-7630290&quot;]
}
**What I have tried so far,**
*Way 1:* causing java.lang.OutOfMemoryError: Java heap space
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect(): 
     hashed_value = ind.hash_of_cc_pn_li
     property_id = ind.id
     if hashed_value in duplicate_property_list:
         duplicate_property_list[hashed_value].append(property_id) 
     else:
         duplicate_property_list[hashed_value] = [property_id] 
*Way 2:* Not working because of missing native OFFSET on pyspark
%%time
i = 0
limit = 1000000
for offset in range(0, total_record,limit):
    i = i + 1
    if i != 1:
        offset = offset + 1
        
    duplicate_property_list = {}
    duplicate_properties = {}
    
    # Preparing dataframe
    url = &#39;&#39;&#39;select id, hash_of_cc_pn_li from properties_df LIMIT {} OFFSET {}&#39;&#39;&#39;.format(limit,offset)  
    properties_sql_df = spark.sql(url)
    
    # Grouping dataset
    rows = properties_sql_df.groupBy(&quot;hash_of_cc_pn_li&quot;).agg(F.collect_set(&quot;id&quot;).alias(&quot;ids&quot;)).collect()
    duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
    
    # Filter a dictionary to keep elements only where duplicate cound
    duplicate_properties = filterTheDict(duplicate_property_list, lambda elem : len(elem[1]) &gt;=2)
    
    # Writing to file
    with open(&#39;duplicate_detected/duplicate_property_list_all_&#39;+str(i)+&#39;.json&#39;, &#39;w&#39;) as fp:
        json.dump(duplicate_property_list, fp)
**What I get now on the console:**
&gt; java.lang.OutOfMemoryError: Java heap space
and showing this error on **Jupyter notebook output**
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
  [1]: https://stackoverflow.com/questions/37335/how-to-deal-with-java-lang-outofmemoryerror-java-heap-space-error
**This is the followup question that I asked here:** https://stackoverflow.com/questions/63103302/creating-dictionary-from-pyspark-dataframe-showing-outofmemoryerror-java-heap-s

英文:

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.

I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,

property_sql_df.show() shows,

+--------------+------------+--------------------+--------------------+
|            id|country_code|       name|          hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
|  BOND-9129450|          US|Scotron Home w/Ga...|90cb0946cf4139e12...|
|  BOND-1742850|          US|Sited in the Mead...|d5c301f00e9966483...|
|  BOND-3211356|          US|NEW LISTING - Com...|811fa26e240d726ec...|
|  BOND-7630290|          US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
|  BOND-7175508|          US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+

What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.

Expected Output

{
  &quot;90cb0946cf4139e12&quot;: [&quot;BOND-9129450&quot;, &quot;BOND-7175508&quot;]
  &quot;d5c301f00e9966483&quot;: [&quot;BOND-1742850&quot;,&quot;BOND-7630290&quot;]
}

What I have tried so far,

Way 1: causing java.lang.OutOfMemoryError: Java heap space

%%time
duplicate_property_list = {}
for ind in property_sql_df.collect(): 
     hashed_value = ind.hash_of_cc_pn_li
     property_id = ind.id
     if hashed_value in duplicate_property_list:
         duplicate_property_list[hashed_value].append(property_id) 
     else:
         duplicate_property_list[hashed_value] = [property_id]

Way 2: Not working because of missing native OFFSET on pyspark

%%time
i = 0
limit = 1000000
for offset in range(0, total_record,limit):
    i = i + 1
    if i != 1:
        offset = offset + 1
        
    duplicate_property_list = {}
    duplicate_properties = {}
    
    # Preparing dataframe
    url = &#39;&#39;&#39;select id, hash_of_cc_pn_li from properties_df LIMIT {} OFFSET {}&#39;&#39;&#39;.format(limit,offset)  
    properties_sql_df = spark.sql(url)
    
    # Grouping dataset
    rows = properties_sql_df.groupBy(&quot;hash_of_cc_pn_li&quot;).agg(F.collect_set(&quot;id&quot;).alias(&quot;ids&quot;)).collect()
    duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
    
    # Filter a dictionary to keep elements only where duplicate cound
    duplicate_properties = filterTheDict(duplicate_property_list, lambda elem : len(elem[1]) &gt;=2)
    
    # Writing to file
    with open(&#39;duplicate_detected/duplicate_property_list_all_&#39;+str(i)+&#39;.json&#39;, &#39;w&#39;) as fp:
        json.dump(duplicate_property_list, fp)

What I get now on the console:

> java.lang.OutOfMemoryError: Java heap space

and showing this error on Jupyter notebook output

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)

This is the followup question that I asked here: https://stackoverflow.com/questions/63103302/creating-dictionary-from-pyspark-dataframe-showing-outofmemoryerror-java-heap-s

答案1

得分: 1

为什么不将尽可能多的数据和处理保留在执行器中，而不是收集到驱动程序中呢？如果我理解正确的话，你可以使用pyspark的转换和聚合功能，直接保存为JSON格式，从而利用执行器，然后将该JSON文件（可能会分区）加载回Python中作为字典。诚然，这会引入IO开销，但这应该可以解决内存溢出错误。逐步操作如下：

import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
    ("BOND-9129450", "90cb"),
    ("BOND-1742850", "d5c3"),
    ("BOND-3211356", "811f"),
    ("BOND-7630290", "d5c3"),
    ("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
df.groupBy(
    f.col("hash_of_cc_pn_li"),
).agg(
    f.collect_set("id").alias("id")  # 如果不关心BOND-XXXXX值的去重，可以在这里使用f.collect_list()
).write.json("./test.json")

检查输出路径：

ls -l ./test.json
-rw-r--r-- 1 jovyan users  0 7月  27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 7月  27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 7月  27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 7月  27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users  0 7月  27 08:29 _SUCCESS
_SUCCESS

作为dict加载到Python中：

import json
from glob import glob
data = []
for file_name in glob('./test.json/*.json'):
    with open(file_name) as f:
        try:
            data.append(json.load(f))
        except json.JSONDecodeError:  # 这里肯定有更好的方法 - 这只是因为某些分区可能为空
            pass

最终结果：

{item['hash_of_cc_pn_li']: item['id'] for item in data}
{'d5c3': ['BOND-7630290', 'BOND-1742850'],
 '811f': ['BOND-3211356'],
 '90cb': ['BOND-9129450', 'BOND-7175508']}

希望这能有所帮助！谢谢你提出这个很好的问题！

英文:

Why not keep as much data and processing in Executors, rather than collecting to Driver? If I understand this correctly, you could use pyspark transformations and aggregations and save directly to JSON, therefore leveraging executors, then load that JSON file (likely partitioned) back into Python as a dictionary. Admittedly, you introduce IO overhead, but this should allow you to get around your OOM heap space errors. Step-by-step:

import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
    (&quot;BOND-9129450&quot;, &quot;90cb&quot;),
    (&quot;BOND-1742850&quot;, &quot;d5c3&quot;),
    (&quot;BOND-3211356&quot;, &quot;811f&quot;),
    (&quot;BOND-7630290&quot;, &quot;d5c3&quot;),
    (&quot;BOND-7175508&quot;, &quot;90cb&quot;),
]
df = spark.createDataFrame(data, [&quot;id&quot;, &quot;hash_of_cc_pn_li&quot;])
df.groupBy(
    f.col(&quot;hash_of_cc_pn_li&quot;),
).agg(
    f.collect_set(&quot;id&quot;).alias(&quot;id&quot;)  # use f.collect_list() here if you&#39;re not interested in deduplication of BOND-XXXXX values
).write.json(&quot;./test.json&quot;)

Inspecting the output path:

ls -l ./test.json
-rw-r--r-- 1 jovyan users  0 Jul 27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 Jul 27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users  0 Jul 27 08:29 _SUCCESS
_SUCCESS

Loading to Python as dict:

import json
from glob import glob
data = []
for file_name in glob(&#39;./test.json/*.json&#39;):
    with open(file_name) as f:
        try:
            data.append(json.load(f))
        except json.JSONDecodeError:  # there is definitely a better way - this is here because some partitions might be empty
            pass

Finally

{item[&#39;hash_of_cc_pn_li&#39;]:item[&#39;id&#39;] for item in data}
{&#39;d5c3&#39;: [&#39;BOND-7630290&#39;, &#39;BOND-1742850&#39;],
 &#39;811f&#39;: [&#39;BOND-3211356&#39;],
 &#39;90cb&#39;: [&#39;BOND-9129450&#39;, &#39;BOND-7175508&#39;]}

I hope this helps! Thank you for the good question!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从大型Pyspark数据帧创建字典时出现OutOfMemoryError：Java堆空间。

问题

答案1

从BeautifulSoup4的结果创建数据框由于结构问题无法工作。

PySpark的`monotonically_increasing_id`在本地和AWS EMR上的结果不同。

比较LDAP密码与另一个值（ldap3，Python）

How to refresh JInternalFrame or JPanel form on button click where JPanel is a separate class and used in JInternalFrame

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。