英文:
Copy of Incremental source table with Spark
问题
一个在SQL数据库中的源表每秒都在增加(新行)。
我想每天运行一些Spark代码(也许使用Structured Streaming?),以追加自上次运行代码以来的新行。副本将是Databricks上的一个增量表。
我不确定spark.readStream
是否适用,因为源表不是增量表,而是JDBC(SQL)表。
英文:
A source table in an SQL DB increments (new rows) every second.
I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.
I'm not sure spark.readStream
will work since the source table is not delta, rather JDBC (SQL)
答案1
得分: 1
Structured Streaming不支持JDBC数据源:链接
如果您的源表中有一个严格递增的列,您可以以批处理模式读取它,并将进度存储在目标Delta表的userMetadata
中 链接。
英文:
Structured Streaming doesn't support JDBC source: link
If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata
in your target Delta table link.
答案2
得分: 0
你可以执行spark.readStream.format("delta")。
你必须定义一个检查点,它将存储与流式处理管道相关的所有元数据。假设你在第一次运行时流到版本2。在下一天重新启动管道时,即使你的源表在版本10,流也会从版本3重新开始。
英文:
You can perform spark.readStream.format("delta").
You have to define a checkpoint, which will store all the metadata related to the streaming pipeline. Let's say you streamed to version 2 on your first run. On the next day when you restart the pipeline, even if your source table is at version 10, the stream will restart from version 3.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论