问题

以下是翻译好的内容：

我正在尝试从S3存储桶加载所有传入的Parquet文件，并使用Delta Lake对它们进行处理。我遇到了一个异常。

val df = spark.readStream().parquet("s3a://$bucketName/")

df.select("unit") //过滤数据！
        .writeStream()
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", checkpointFolder)
        .start(bucketProcessed) //输出到另一个存储桶
        .awaitTermination()

它抛出一个异常，因为“unit”是不明确的。

我尝试过调试它。由于某种原因，它找到了两个“unit”。

发生了什么？这可能是一个编码问题吗？

编辑：
这是我创建Spark会话的方式：

val spark = SparkSession.builder()
    .appName("streaming")
    .master("local")
    .config("spark.hadoop.fs.s3a.endpoint", endpoint)
    .config("spark.hadoop.fs.s3a.access.key", accessKey)
    .config("spark.hadoop.fs.s3a.secret.key", secretKey)
    .config("spark.hadoop.fs.s3a.path.style.access", true)
    .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
    .config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
    .config("spark.sql.caseSensitive", true)
    .config("spark.sql.streaming.schemaInference", true)
    .config("spark.sql.parquet.mergeSchema", true)
    .orCreate

编辑2：
从df.printSchema()输出：

2020-10-21 13:15:33,962 [main] WARN  org.apache.spark.sql.execution.datasources.DataSource - 数据模式和分区模式中发现重复的列：`unit`；
root
 |-- unit: string (nullable = true)
 |-- unit: string (nullable = true)

英文:

I'm trying to load all incoming parquet files from an S3 Bucket, and process them with delta-lake. I'm getting an exception.

val df = spark.readStream().parquet(&quot;s3a://$bucketName/&quot;)
df.select(&quot;unit&quot;) //filter data!
.writeStream()
.format(&quot;delta&quot;)
.outputMode(&quot;append&quot;)
.option(&quot;checkpointLocation&quot;, checkpointFolder)
.start(bucketProcessed) //output goes in another bucket
.awaitTermination()

It throws an exception, because "unit" is ambiguous.

I've tried debugging it. For some reason, it finds "unit" twice.

What is going on here? Could it be an encoding issue?

edit:
This is how I create the spark session:

val spark = SparkSession.builder()
.appName(&quot;streaming&quot;)
.master(&quot;local&quot;)
.config(&quot;spark.hadoop.fs.s3a.endpoint&quot;, endpoint)
.config(&quot;spark.hadoop.fs.s3a.access.key&quot;, accessKey)
.config(&quot;spark.hadoop.fs.s3a.secret.key&quot;, secretKey)
.config(&quot;spark.hadoop.fs.s3a.path.style.access&quot;, true)
.config(&quot;spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version&quot;, 2)
.config(&quot;spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored&quot;, true)
.config(&quot;spark.sql.caseSensitive&quot;, true)
.config(&quot;spark.sql.streaming.schemaInference&quot;, true)
.config(&quot;spark.sql.parquet.mergeSchema&quot;, true)
.orCreate

edit2:
output from df.printSchema()

2020-10-21 13:15:33,962 [main] WARN  org.apache.spark.sql.execution.datasources.DataSource -  Found duplicate column(s) in the data schema and the partition schema: `unit`;
root
|-- unit: string (nullable = true)
|-- unit: string (nullable = true)

答案1

得分: 0

val df = spark.readStream().parquet("s3a://$bucketName/*")

...解决了这个问题。出于某种原因，我很想知道为什么...

英文:

Reading the same data like this...

val df = spark.readStream().parquet(&quot;s3a://$bucketName/*&quot;)

...solves the issue. For whatever reason. I would love to know why...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

参考’unit’是不明确的，可能是：单位，单元

问题

答案1

在指定的索引处添加单个元素。

Headers getting added to file content while retrieving file from APIGatewayProxyRequestEvent in AWS lambda

过去的日期增加了一小时，当转换为Europe/Moscow时区时。

解决方法以导入 Java 库用于 Mingw / iOS / Linux / 其他源集的问题？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论