问题

我的目标是将传入的Parquet文件持续放入Delta Lake中，进行查询，并将结果获取到一个Rest API中。
所有文件都在S3存储桶中。

//监听变化
val df = spark.readStream().parquet("s3a://myBucket/folder")

//将变化写入Delta Lake
df.writeStream()
.format("delta")
.option("checkpointLocation", "s3a://myBucket-processed/checkpoint")
.start("s3a://myBucket-processed/")
.awaitTermination() //此调用位于另一个线程中（因为它是阻塞的）

//这是一个不好的示例
val query = df.select(convertedColumnNames)
query.show()

//另一个不好的示例：
spark.readStream().format("delta").load("s3a://myBucket-processed/").select(convertedColumnNames).show()

//org.apache.spark.sql.AnalysisException：必须使用writeStream.start()来执行带有流式源的查询；；

如何从Delta Lake中获取筛选后的数据？

英文:

My goal is to continuously put incoming parquet files into delta-lake, make queries, and get the results into a Rest API.
All files are in s3 buckets.

//listen for changes
val df = spark.readStream().parquet(&quot;s3a://myBucket/folder&quot;)

//write changes to delta lake
df.writeStream()
    .format(&quot;delta&quot;)
    .option(&quot;checkpointLocation&quot;, &quot;s3a://myBucket-processed/checkpoint&quot;)
    .start(&quot;s3a://myBucket-processed/&quot;)
    .awaitTermination() //this call lives in another thread (because it&#39;s blocking)

//this is a bad example
val query = df.select(convertedColumnNames) 
query.show()

//another bad example:
spark.readStream().format(&quot;delta&quot;).load(&quot;s3a://myBucket-processed/&quot;).select(convertedColumnNames).show()

//org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;

How can I get the filtered data out from delta lake?

答案1

得分: 1

你尝试使用 foreachBatch 了吗？

它将所有批处理特性引入了流处理，还可以在写入Delta Lake时在一定程度上控制文件的数量。

英文:

Did you try using foreachBatch?

It brings all batch like features to streaming and you can also somewhat control number of files you are writing into delta lake.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据流式传输到 Delta Lake，读取经过筛选的结果。

问题

答案1

自类型 – 要求在类型约束内部有一个类型

如何在Java中将变量赋值给需要表达式的位置？

Show multiple details in console, but not in textarea. How to show specific lines in textarea

如何访问对象 o 的 getter？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论