2020年9月18日 15:44:13go评论178阅读模式

英文:

AWS Lambda with spark library gives OutOfMemoryError

问题

以下是您要翻译的内容：

我正在尝试在我的AWS Lambda中使用以下Spark库：

    implementation "org.apache.spark:spark-core_2.12:2.4.6"
    implementation "org.apache.spark:spark-sql_2.12:2.4.6"

我最初使用内存大小：576 MB，然后是1024 MB 来运行 Lambda。两次都出现了以下错误：

    Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace
    Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace
    at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:65)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
    Caused by: java.lang.OutOfMemoryError: Metaspace
    Exception in thread "Thread-3" java.lang.OutOfMemoryError: Metaspace

当内存大小设置为2048 MB时，成功运行。

我想知道在AWS Lambda中使用Spark所需的实际内存大小是多少。是否有轻量级版本的库可用？我正在使用该库创建Parquet文件并将其上传到S3。

谢谢。

英文:

I am trying to use following spark libraries in my aws lambda:

implementation &quot;org.apache.spark:spark-core_2.12:2.4.6&quot;
implementation &quot;org.apache.spark:spark-sql_2.12:2.4.6&quot;

I ran Lambda initially with memory: 576 MB and then 1024 MB. Both times it failed with:

Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace
Exception in thread &quot;main&quot; java.lang.Error: java.lang.OutOfMemoryError: Metaspace
at lambdainternal.AWSLambda.&lt;clinit&gt;(AWSLambda.java:65)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
Caused by: java.lang.OutOfMemoryError: Metaspace
Exception in thread &quot;Thread-3&quot; java.lang.OutOfMemoryError: Metaspace

It ran successfully when ran with memory size: 2048 MB

I would like to know what is the actual memory size needed to use spark in AWS lambda. Is there any lighter version of the library. I am using this library to create Parquet file and upload it to S3.

Thanks.

答案1

得分: 1

你分配给你的Java Lambda函数的内存量由堆、元空间和保留的代码内存共享。

考虑只增加-XX:MaxMetaspaceSize大小，因为根据你的异常日志，Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace问题与元空间有关。

你可以通过只增加元空间大小而不改变堆和缓冲区空间来进行自定义调整（注意：Spark 可能会加载大量类并利用元空间），还请考虑在集群模式下运行你的 Spark 应用程序。

你可以查看这个线程，了解更多关于堆内存、元空间和保留代码内存的信息。

英文:

The amount of memory you allocate to your java lambda function is shared by heap, meta, and reserved code memory.

you can consider increasing only -XX:MaxMetaspaceSize size because as per your exception log Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace issue is related to metaspace

you can custom tune by increasing only metaspace without changing heap and buffer space. (note: spark might be loading a lot of classes and utilizing metaspace) and please also consider running your spark app in cluster mode.

you can check this thread

for more info about heap memory,metaspace and reserved code memory.

答案2

得分: 1

你绝对不想在Lambda函数中包含Spark作为依赖项。Spark对于Lambda来说太重了。Spark应该在一个集群上运行，而Lambda不是一个集群。

如果你想运行无服务器的Spark代码，请查看AWS Glue...或者不要，因为AWS Glue相对复杂。

如果你的文件足够小，可以在Lambda函数中转换为Parquet格式，请查看AWS Data Wrangler。发布版包含预构建的层，因此你不需要担心构建层的所有底层细节（弄清楚numpy和PyArrow真的很麻烦 - 只需使用该库）。

以下是写出Parquet文件的代码：

import awswrangler as wr
import pandas as pd

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# 将数据存储在数据湖上
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

英文:

You definitely don't want to include Spark as a dependency in a Lambda function. Spark is way too heavy for Lambda. Spark should be run on a cluster and Lambda isn't a cluster.

If you want to run serverless Spark code, check out AWS Glue... or don't cause AWS Glue is relatively complicated to use.

If your file is sufficiently small to be converted to Parquet in a Lambda function, check out AWS Data Wrangler. The releases contain pre-built layers, so you don't need to worry about all the low level details for building Layers (figuring out numpy & PyArrow is really annoying - just use the lib).

Here's the code that writes out a Parquet file:

import awswrangler as wr
import pandas as pd

df = pd.DataFrame({&quot;id&quot;: [1, 2], &quot;value&quot;: [&quot;foo&quot;, &quot;boo&quot;]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path=&quot;s3://bucket/dataset/&quot;,
    dataset=True,
    database=&quot;my_db&quot;,
    table=&quot;my_table&quot;
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

AWS Lambda与Spark库一起使用会导致OutOfMemoryError。

问题

答案1

答案2

Java – 使用不同方法生成SecretKeySpec的AES CBC算法

PrintWriter只打印出文本文件的第一行。

AudioRecord的read()方法返回的字节数并不与sizeInBytes参数中请求的字节数相同。

如何使用selenium/java切换到新的弹出窗口。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论