问题

目前，我有一个名为cdn_daily_user_playback_requests_1MONTH的BigQuery表。其中包含大量基于每日记录的数据。因此，会有来自整个2023年7月、2023年8月等整月数据。现在，例如，如果我想从2023年7月创建新数据并将其写入该BigQuery表，而该表已经包含来自2023年7月的记录，我应该如何在我的Java Apache Beam代码中执行此操作（替换表中的当前数据为新数据）？

我的管道代码如下：

pipeline
            .apply("Read from cdn_requests BigQuery", BigQueryIO
                    .read(new CdnMediaRequestLogEntity.FromSchemaAndRecord())
                    .fromQuery(cdnRequestsQueryString)
                    .usingStandardSql())
            .apply("Validate and Filter Cdn Media Request Log Objects", Filter.by(new CdnMediaRequestValidator()))
            .apply("Convert Cdn Logs To Key Value Pairs", ParDo.of(new CdnMediaRequestResponseSizeKeyValuePairConverter()))
            .apply("Sum the Response Sizes By Key", Sum.longsPerKey())
            .apply("Convert To New Daily Requests Objects", ParDo.of(new CdnDailyRequestConverter(projectId, kind)))
            .apply("Convert Cdn Media Request Entities to Big Query Objects", ParDo.of(new BigQueryCdnDailyRequestRowConverter()))
            .apply("Write Data To BigQuery", BigQueryIO.writeTableRows()
                .to(writeCdnMediaRequestTable)
                .withSchema(cdnDailyRequestSchema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

我尝试过并测试了BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE函数，但据我理解，这会删除表中的所有数据，然后写入新创建的数据。但我只想删除仅针对2023年7月的数据，而不是全部数据。

英文:

Currently, I have BigQuery table called cdn_daily_user_playback_requests_1MONTH. This contains large amounts of data based from a daily basis of records. So there would be like data from the whole month of 2023-07, 2023-08, etc. Now, say for example that I want to create new data from 2023-07 and write it into that BigQuery table and that table already has records from the 2023-07 month, how do I do this (replacing the current data in the table to the new one I have) in my Apache Beam code in Java?

My pipeline code is here:

pipeline
            .apply(&quot;Read from cdn_requests BigQuery&quot;, BigQueryIO
                    .read(new CdnMediaRequestLogEntity.FromSchemaAndRecord())
                    .fromQuery(cdnRequestsQueryString)
                    .usingStandardSql())
            .apply(&quot;Validate and Filter Cdn Media Request Log Objects&quot;, Filter.by(new CdnMediaRequestValidator()))
            .apply(&quot;Convert Cdn Logs To Key Value Pairs&quot;, ParDo.of(new CdnMediaRequestResponseSizeKeyValuePairConverter()))
            .apply(&quot;Sum the Response Sizes By Key&quot;, Sum.longsPerKey())
            .apply(&quot;Convert To New Daily Requests Objects&quot;, ParDo.of(new CdnDailyRequestConverter(projectId, kind)))
            .apply(&quot;Convert Cdn Media Request Entities to Big Query Objects&quot;, ParDo.of(new BigQueryCdnDailyRequestRowConverter()))
            .apply(&quot;Write Data To BigQuery&quot;, BigQueryIO.writeTableRows()
                .to(writeCdnMediaRequestTable)
                .withSchema(cdnDailyRequestSchema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

I did tried and tested the BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE function but in my understanding, this removes all the data inside that table and then writes the newly created ones in it. But I only want to remove the data only for 2023-07's month and not everything.

答案1

得分: 2

这是解决方案的代码部分，已经被翻译好了，没有其他内容：

public class BigQueryDayPartitionDestinations implements SerializableFunction&lt;ValueInSingleWindow&lt;TableRow&gt;, TableDestination&gt; {

    private final String projectId;
    private final String datasetId;
    private final String pattern;
    private final String table;

    public static BigQueryDayPartitionDestinations writePartitionsPerDay(String projectId, String datasetId, String tablePrefix) {
        return new BigQueryDayPartitionDestinations(projectId, datasetId, &quot;yyyyMMdd&quot;, tablePrefix + &quot;$&quot;);
    }

    private BigQueryDayPartitionDestinations(String projectId, String datasetId, String pattern, String table) {
        this.projectId = projectId;
        this.datasetId = datasetId;
        this.pattern = pattern;
        this.table = table;
    }

    @Override
    public TableDestination apply(ValueInSingleWindow&lt;TableRow&gt; input) {
        DateTimeFormatter partition = DateTimeFormat.forPattern(pattern).withZone(DateTimeZone.forID(&quot;Asia/Tokyo&quot;));
        DateTimeFormatter formatter = DateTimeFormat.forPattern(&quot;yyyy-MM-dd&quot;).withZone(DateTimeZone.forID(&quot;Asia/Tokyo&quot;));

        TableReference reference = new TableReference();
        reference.setProjectId(this.projectId);
        reference.setDatasetId(this.datasetId);

        var date = input.getValue().get(&quot;Date&quot;).toString();
        DateTime dateTime = formatter.parseDateTime(date);

        var tableId = table + dateTime.toInstant().toString(partition);

        reference.setTableId(tableId);
        return new TableDestination(reference, null, new TimePartitioning().setType(&quot;DAY&quot;).setField(&quot;Date&quot;));
   }
}

英文:

Solution:
So I found a solution and it worked by creating a SerializableFunction which takes the partition key as an identifier (my table was partitioned on the Date column which has a Datatype of Date) upon writing it in BigQuery. So what happens is that it only takes out parts of the table by Partitioned column.

This is my sample code for the solution:

public class BigQueryDayPartitionDestinations implements SerializableFunction&lt;ValueInSingleWindow&lt;TableRow&gt;, TableDestination&gt; {
private final String projectId;
private final String datasetId;
private final String pattern;
private final String table;
public static BigQueryDayPartitionDestinations writePartitionsPerDay(String projectId, String datasetId, String tablePrefix) {
return new BigQueryDayPartitionDestinations(projectId, datasetId, &quot;yyyyMMdd&quot;, tablePrefix + &quot;$&quot;);
}
private BigQueryDayPartitionDestinations(String projectId, String datasetId, String pattern, String table) {
this.projectId = projectId;
this.datasetId = datasetId;
this.pattern = pattern;
this.table = table;
}
@Override
public TableDestination apply(ValueInSingleWindow&lt;TableRow&gt; input) {
DateTimeFormatter partition = DateTimeFormat.forPattern(pattern).withZone(DateTimeZone.forID(&quot;Asia/Tokyo&quot;));
DateTimeFormatter formatter = DateTimeFormat.forPattern(&quot;yyyy-MM-dd&quot;).withZone(DateTimeZone.forID(&quot;Asia/Tokyo&quot;));
TableReference reference = new TableReference();
reference.setProjectId(this.projectId);
reference.setDatasetId(this.datasetId);
var date = input.getValue().get(&quot;Date&quot;).toString();
DateTime dateTime = formatter.parseDateTime(date);
var tableId = table + dateTime.toInstant().toString(partition);
reference.setTableId(tableId);
return new TableDestination(reference, null, new TimePartitioning().setType(&quot;DAY&quot;).setField(&quot;Date&quot;));
}
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在BigQuery Apache Beam Java Dataflow中替换现有行

问题

答案1

连接被拒绝，来自SprigBoot应用到weblogic的连接

广度优先搜索图循环执行

Big O使用堆栈是O(1)吗？

在1到100之间的随机偶数数组

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论