截断现有的BigQuery表格,然后运行DataFlow作业。

huangapple go评论58阅读模式
英文:

Truncate existing BigQuery table before DataFlow job runs

问题

我有一个配置了选择性SQL查询的GCP DataFlow管道,该查询从Postgres表中选择特定行,然后自动将这些行插入到BigQuery数据集中。该管道配置为每天在UTC时间的12:00 AM运行。

当管道启动作业时,它成功运行并复制所需的行。然而,当下一个作业运行时,它再次将相同的一组行复制到BigQuery表中,因此导致数据重复。

我想知道是否有一种在管道运行之前截断BigQuery数据集表的方法。这似乎是一个常见的问题,因此我想知道是否有一种简单的解决方案,而无需使用自定义DataFlow模板。

英文:

I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.

When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.

I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.

答案1

得分: 3

BigQueryIO 具有一个名为 WriteDisposition 的选项,您可以使用 WRITE_TRUNCATE

从上面的链接中,WRITE_TRUNCATE 的含义是:

> 指定写操作应替换表。
>
> 替换可能分为多个步骤 - 例如,首先删除现有表,然后创建替代表,然后填充数据。这不是一个原子操作,外部程序可能在这些中间步骤中看到表格。

如果您的用例不能在操作期间让表格不可用,一种常见的模式是将数据移动到辅助/分段表,然后使用 BigQuery 上的原子操作来替换原始表(例如,使用 CREATE OR REPLACE TABLE)。

英文:

BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.

From the link above, WRITE_TRUNCATE means:

> Specifies that write should replace a table.
>
> The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.

If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).

huangapple
  • 本文由 发表于 2023年2月8日 20:51:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75386067.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定