问题

一个在SQL数据库中的源表每秒都在增加（新行）。

我想每天运行一些Spark代码（也许使用Structured Streaming？），以追加自上次运行代码以来的新行。副本将是Databricks上的一个增量表。

我不确定spark.readStream是否适用，因为源表不是增量表，而是JDBC（SQL）表。

英文:

A source table in an SQL DB increments (new rows) every second.

I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.

I'm not sure spark.readStream will work since the source table is not delta, rather JDBC (SQL)

答案1

得分: 1

Structured Streaming不支持JDBC数据源：链接

如果您的源表中有一个严格递增的列，您可以以批处理模式读取它，并将进度存储在目标Delta表的userMetadata中链接。

英文:

Structured Streaming doesn't support JDBC source: link

If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata in your target Delta table link.

答案2

得分: 0

你可以执行spark.readStream.format("delta")。
你必须定义一个检查点，它将存储与流式处理管道相关的所有元数据。假设你在第一次运行时流到版本2。在下一天重新启动管道时，即使你的源表在版本10，流也会从版本3重新开始。

英文:

You can perform spark.readStream.format("delta").
You have to define a checkpoint, which will store all the metadata related to the streaming pipeline. Let's say you streamed to version 2 on your first run. On the next day when you restart the pipeline, even if your source table is at version 10, the stream will restart from version 3.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

增量源表与Spark的复制

问题

答案1

答案2

spark-submit找不到类（虽然类包含在jar中）

如何在Scala Spark中使用通配符指定S3键进行搜索

更新具有空值的嵌套结构。

将参数传递给使用 `spark.read.format(jdbc)` 格式的查询。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论