增量源表与Spark的复制

huangapple go评论51阅读模式
英文:

Copy of Incremental source table with Spark

问题

一个在SQL数据库中的源表每秒都在增加(新行)。

我想每天运行一些Spark代码(也许使用Structured Streaming?),以追加自上次运行代码以来的新行。副本将是Databricks上的一个增量表。

我不确定spark.readStream是否适用,因为源表不是增量表,而是JDBC(SQL)表。

英文:

A source table in an SQL DB increments (new rows) every second.

I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.

I'm not sure spark.readStream will work since the source table is not delta, rather JDBC (SQL)

答案1

得分: 1

Structured Streaming不支持JDBC数据源:链接

如果您的源表中有一个严格递增的列,您可以以批处理模式读取它,并将进度存储在目标Delta表的userMetadata链接

英文:

Structured Streaming doesn't support JDBC source: link

If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata in your target Delta table link.

答案2

得分: 0

你可以执行spark.readStream.format("delta")。
你必须定义一个检查点,它将存储与流式处理管道相关的所有元数据。假设你在第一次运行时流到版本2。在下一天重新启动管道时,即使你的源表在版本10,流也会从版本3重新开始。

英文:

You can perform spark.readStream.format("delta").
You have to define a checkpoint, which will store all the metadata related to the streaming pipeline. Let's say you streamed to version 2 on your first run. On the next day when you restart the pipeline, even if your source table is at version 10, the stream will restart from version 3.

huangapple
  • 本文由 发表于 2023年7月17日 18:14:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76703459.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定