BigQuery Java读取大型查询的可靠性

huangapple go评论69阅读模式
英文:

BigQuery Java read reliability for larger queries

问题

我正在使用Java从BigQuery中读取超过20,000行的数据,并使用TableResult.iterateAll()来迭代这些数据。我发现其中一些行会被迭代器返回两次,而有些行根本不会被返回。但我得到了正确的总行数。

输入的行包括两个STRUCT数组,因此我怀疑我可能遇到了大小限制。我该如何验证这一点,或者如果不是这个原因,为什么会发生这种情况?

当我对输入查询进行排序时,问题就消失了,这让我怀疑BQ在达到某个限制后会在内部重复查询,并以略有不同的顺序获取行,除非查询明确排序。

QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).setUseLegacySql(false).build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigQueryService.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
TableResult queryResults = queryJob.getQueryResults();

for (FieldValueList row : queryResult.iterateAll()) {
    // 处理行
}
英文:

I am reading 20,000+ rows of data from BigQuery in Java and using TableResult.iterateAll() to iterate through these. I am finding that a handful of the rows are returned twice and some are not returned at all by the iterator. I get the correct total number of rows.

The input rows include two arrays of STRUCTs so I wonder if I am hitting a size constraint. How can I verify this, or if not, why is this happening?

When I order my input query, the problem goes away, leading me to suspect that BQ is repeating the query internally after it hits some limit and getting the rows back in a slightly different order, unless the query is explicitly ordered.

QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).setUseLegacySql(false).build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigQueryService.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
TableResult queryResults = queryJob.getQueryResults();

for (FieldValueList row : queryResult.iterateAll()) {
    // Process row
}

答案1

得分: 2

你在使用TableResult.iterateAll()从BigQuery读取数据时遇到的重复和丢失行的问题可能与分页处理方式以及查询结果的排序方式有关。正如你提到的,你能够获得正确的总行数,这表明问题不在于查询本身,而很可能是分页处理的问题。

TableResult.iterateAll()在内部处理分页以从查询结果中检索所有行。当结果集较大时,BigQuery可能会将结果分成多个页面。当你遍历这些页面时,如果存在重叠页面或者分页机制遇到任何问题,可能会出现一些行出现多次或被跳过的情况。

你还可以修改你的查询,包括一个基于唯一标识符或数据集中的稳定列的ORDER BY子句。这将确保结果的一致排序,并有助于消除重复的问题。在查询中明确指定排序应该提供更一致和可靠的结果。

英文:

The issue you are encountering with duplicate and missing rows when reading data from BigQuery using TableResult.iterateAll() is likely related to the way pagination is handled and how the query results are ordered. As you mentioned that you get the correct total number of rows, it indicates that the issue is not with the query itself but likely with the pagination.

TableResult.iterateAll() internally handles pagination to retrieve all the rows from the query results. When the result set is large, BigQuery may split the results into multiple pages. As you iterate through these pages, it's possible for some rows to appear more than once or be skipped altogether if there are overlapping pages or if the pagination mechanism encounters any issues.

You can also modify your query to include an ORDER BY clause based on a unique identifier or a stable column in your dataset. This will ensure consistent ordering of the results and may help eliminate the issue of duplicates. Using explicit ordering in your query should provide a more consistent and reliable result.

huangapple
  • 本文由 发表于 2023年7月13日 17:12:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677737.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定