2023年8月10日 18:29:07go评论140阅读模式

英文:

Write to Iceberg/Glue table from local PySpark session

问题

我想要能够从我的本地机器使用Python操作托管在AWS Glue上的Iceberg表（读/写）。

我已经完成了以下工作：

创建了一个Iceberg表并在AWS Glue上注册了它
使用Athena将Iceberg表填充了有限的数据

我可以使用PyIceberg从我的本地笔记本访问（只读）远程Iceberg表，现在我想向其写入数据。问题在于Athena对写操作施加了一些严格的限制，而我最终希望使用Python中类似数据框的接口向Iceberg表写入数据，目前唯一的选择似乎是PySpark。

所以，我正在尝试在我的本地笔记本上运行一个PySpark集群，使用我在以下引用中找到的配置：

设置代码似乎运行良好，打印输出与参考视频非常相似：

# 代码示例

现在，当我尝试使用以下代码运行查询时：

# 代码示例

我收到以下错误：

# 错误示例

我一直在尝试通过更改配置并将Iceberg的jar包版本复制到Spark主目录中来解决此问题，但目前还没有成功... 总的来说，使用Spark/Iceberg/Glue一直是一个困难和令人沮丧的经验。希望有人能帮助我。

英文:

I want to be able to operate (read/write) to an Iceberg table hosted on AWS Glue, from my local machine, using Python.

I have already:

Created an Iceberg table and registered it on AWS Glue
Populated the Iceberg table with limited data using Athena

I can access (read-only) the remote Iceberg table from my local laptop using PyIceberg, and now I want to write data to it. The problem is that Athena imposes some strict limits on write operations, and at the end of the day I’d like to write to the Iceberg table using a dataframe-like interface from Python, and the only option seems to be PySpark for now.

So, I’m, trying to do it, running a PySpark cluster on my local laptop, using the configurations I found on those refs:

The setup code seems to run fine, with the prints very similar to the reference video:

from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType
import pyspark
import os
conf = (
    pyspark.SparkConf()
        .setAppName(&#39;luiz-session&#39;)
  		#packages
        .set(&#39;spark.jars.packages&#39;, &#39;org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18,org.apache.spark:spark-hadoop-cloud_2.12:3.2.0&#39;)
  		#SQL Extensions
        .set(&#39;spark.sql.extensions&#39;, &#39;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&#39;)
  		#Configuring Catalog
        .set(&#39;spark.sql.catalog.glue&#39;, &#39;org.apache.iceberg.spark.SparkCatalog&#39;)
        .set(&#39;spark.sql.catalog.glue.catalog-impl&#39;, &#39;org.apache.iceberg.aws.glue.GlueCatalog&#39;)
        .set(&#39;spark.sql.catalog.glue.warehouse&#39;, &quot;s3://my-bucket/iceberg-data&quot;)
        .set(&#39;spark.sql.catalog.glue.io-impl&#39;, &#39;org.apache.iceberg.aws.s3.S3FileIO&#39;)
  		#AWS CREDENTIALS
        .set(&#39;spark.hadoop.fs.s3a.access.key&#39;, os.environ.get(&quot;AWS_ACCESS_KEY_ID&quot;))
        .set(&#39;spark.hadoop.fs.s3a.secret.key&#39;, os.environ.get(&quot;AWS_SECRET_ACCESS_KEY&quot;))
)
## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print(&quot;Spark Running&quot;)

Now, when I try to run a query using this:

spark.sql(&quot;SELECT * FROM glue.iceberg_table LIMIT 10;&quot;).show()

I get the following error:

IllegalArgumentException: Cannot initialize Catalog implementation org.apache.iceberg.aws.glue.GlueCatalog: Cannot find constructor for interface org.apache.iceberg.catalog.Catalog
	Missing org.apache.iceberg.aws.glue.GlueCatalog [java.lang.NoClassDefFoundError: software/amazon/awssdk/services/glue/model/InvalidInputException]

I’ve been trying to change the fix this by changing the conf and copying the Iceberg jar releases to the spark home folder, but no luck so far… Overall it has been a difficult and frustrating experience with Spark/Iceberg/Glue.
I hope someone can help me.

答案1

得分: 0

我找到的在最后使用amazon/aws-glue-libs:glue_libs_4.0.0_image_01 Docker镜像，并将Spark配置中的设置包移除，是在本地使用Glue和Iceberg进行开发的唯一方法，同时设置DATALAKE_FORMATS=iceberg。

参考链接：

英文:

the only way I found to develop local with glue and iceberg at the end was using the amazon/aws-glue-libs:glue_libs_4.0.0_image_01 docker image with DATALAKE_FORMATS=iceberg, and removing the set packages from spark configuration.

Refs:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据从本地PySpark会话写入Iceberg/Glue表格。

问题

答案1

是的，可以执行一个聚合操作，将所有字段都包括在分组中。

PySpark：根据列中的数字和多个条件创建新行（展开）。

Spark UI报告的执行计划时间与实际时间相差3倍。

如何在数据框中获取列的索引/位置（Spark SQL Java）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。