AWS Lambda与Spark库一起使用会导致OutOfMemoryError。

huangapple go评论78阅读模式
英文:

AWS Lambda with spark library gives OutOfMemoryError

问题

以下是您要翻译的内容:

我正在尝试在我的AWS Lambda中使用以下Spark库:

    implementation "org.apache.spark:spark-core_2.12:2.4.6"
    implementation "org.apache.spark:spark-sql_2.12:2.4.6"

我最初使用内存大小:576 MB,然后是1024 MB 来运行 Lambda。两次都出现了以下错误:

    Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace
    Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace
    at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:65)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
    Caused by: java.lang.OutOfMemoryError: Metaspace
    Exception in thread "Thread-3" java.lang.OutOfMemoryError: Metaspace

当内存大小设置为2048 MB时,成功运行。

我想知道在AWS Lambda中使用Spark所需的实际内存大小是多少。是否有轻量级版本的库可用?我正在使用该库创建Parquet文件并将其上传到S3。

谢谢。

英文:

I am trying to use following spark libraries in my aws lambda:

implementation &quot;org.apache.spark:spark-core_2.12:2.4.6&quot;
implementation &quot;org.apache.spark:spark-sql_2.12:2.4.6&quot;

I ran Lambda initially with memory: 576 MB and then 1024 MB. Both times it failed with:

Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace
Exception in thread &quot;main&quot; java.lang.Error: java.lang.OutOfMemoryError: Metaspace
at lambdainternal.AWSLambda.&lt;clinit&gt;(AWSLambda.java:65)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
Caused by: java.lang.OutOfMemoryError: Metaspace
Exception in thread &quot;Thread-3&quot; java.lang.OutOfMemoryError: Metaspace

It ran successfully when ran with memory size: 2048 MB

I would like to know what is the actual memory size needed to use spark in AWS lambda. Is there any lighter version of the library. I am using this library to create Parquet file and upload it to S3.

Thanks.

答案1

得分: 1

你分配给你的Java Lambda函数的内存量由堆、元空间和保留的代码内存共享。

考虑只增加-XX:MaxMetaspaceSize大小,因为根据你的异常日志,Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace问题与元空间有关。

你可以通过只增加元空间大小而不改变堆和缓冲区空间来进行自定义调整(注意:Spark 可能会加载大量类并利用元空间),还请考虑在集群模式下运行你的 Spark 应用程序。

你可以查看这个线程,了解更多关于堆内存、元空间和保留代码内存的信息。

英文:

The amount of memory you allocate to your java lambda function is shared by heap, meta, and reserved code memory.

you can consider increasing only -XX:MaxMetaspaceSize size because as per your exception log Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace issue is related to metaspace

you can custom tune by increasing only metaspace without changing heap and buffer space. (note: spark might be loading a lot of classes and utilizing metaspace) and please also consider running your spark app in cluster mode.

you can check this thread

for more info about heap memory,metaspace and reserved code memory.

答案2

得分: 1

你绝对不想在Lambda函数中包含Spark作为依赖项。Spark对于Lambda来说太重了。Spark应该在一个集群上运行,而Lambda不是一个集群。

如果你想运行无服务器的Spark代码,请查看AWS Glue...或者不要,因为AWS Glue相对复杂。

如果你的文件足够小,可以在Lambda函数中转换为Parquet格式,请查看AWS Data Wrangler发布版包含预构建的层,因此你不需要担心构建层的所有底层细节(弄清楚numpy和PyArrow真的很麻烦 - 只需使用该库)。

以下是写出Parquet文件的代码:

import awswrangler as wr
import pandas as pd

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# 将数据存储在数据湖上
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)
英文:

You definitely don't want to include Spark as a dependency in a Lambda function. Spark is way too heavy for Lambda. Spark should be run on a cluster and Lambda isn't a cluster.

If you want to run serverless Spark code, check out AWS Glue... or don't cause AWS Glue is relatively complicated to use.

If your file is sufficiently small to be converted to Parquet in a Lambda function, check out AWS Data Wrangler. The releases contain pre-built layers, so you don't need to worry about all the low level details for building Layers (figuring out numpy & PyArrow is really annoying - just use the lib).

Here's the code that writes out a Parquet file:

import awswrangler as wr
import pandas as pd

df = pd.DataFrame({&quot;id&quot;: [1, 2], &quot;value&quot;: [&quot;foo&quot;, &quot;boo&quot;]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path=&quot;s3://bucket/dataset/&quot;,
    dataset=True,
    database=&quot;my_db&quot;,
    table=&quot;my_table&quot;
)

huangapple
  • 本文由 发表于 2020年9月18日 15:44:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/63951340.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定