问题

我正在使用Java从BigQuery中读取超过20,000行的数据，并使用TableResult.iterateAll()来迭代这些数据。我发现其中一些行会被迭代器返回两次，而有些行根本不会被返回。但我得到了正确的总行数。

输入的行包括两个STRUCT数组，因此我怀疑我可能遇到了大小限制。我该如何验证这一点，或者如果不是这个原因，为什么会发生这种情况？

当我对输入查询进行排序时，问题就消失了，这让我怀疑BQ在达到某个限制后会在内部重复查询，并以略有不同的顺序获取行，除非查询明确排序。

QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).setUseLegacySql(false).build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigQueryService.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
TableResult queryResults = queryJob.getQueryResults();

for (FieldValueList row : queryResult.iterateAll()) {
    // 处理行
}

英文:

I am reading 20,000+ rows of data from BigQuery in Java and using TableResult.iterateAll() to iterate through these. I am finding that a handful of the rows are returned twice and some are not returned at all by the iterator. I get the correct total number of rows.

The input rows include two arrays of STRUCTs so I wonder if I am hitting a size constraint. How can I verify this, or if not, why is this happening?

When I order my input query, the problem goes away, leading me to suspect that BQ is repeating the query internally after it hits some limit and getting the rows back in a slightly different order, unless the query is explicitly ordered.

QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).setUseLegacySql(false).build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigQueryService.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
TableResult queryResults = queryJob.getQueryResults();

for (FieldValueList row : queryResult.iterateAll()) {
    // Process row
}

答案1

得分: 2

你在使用TableResult.iterateAll()从BigQuery读取数据时遇到的重复和丢失行的问题可能与分页处理方式以及查询结果的排序方式有关。正如你提到的，你能够获得正确的总行数，这表明问题不在于查询本身，而很可能是分页处理的问题。

TableResult.iterateAll()在内部处理分页以从查询结果中检索所有行。当结果集较大时，BigQuery可能会将结果分成多个页面。当你遍历这些页面时，如果存在重叠页面或者分页机制遇到任何问题，可能会出现一些行出现多次或被跳过的情况。

你还可以修改你的查询，包括一个基于唯一标识符或数据集中的稳定列的ORDER BY子句。这将确保结果的一致排序，并有助于消除重复的问题。在查询中明确指定排序应该提供更一致和可靠的结果。

英文:

The issue you are encountering with duplicate and missing rows when reading data from BigQuery using TableResult.iterateAll() is likely related to the way pagination is handled and how the query results are ordered. As you mentioned that you get the correct total number of rows, it indicates that the issue is not with the query itself but likely with the pagination.

TableResult.iterateAll() internally handles pagination to retrieve all the rows from the query results. When the result set is large, BigQuery may split the results into multiple pages. As you iterate through these pages, it's possible for some rows to appear more than once or be skipped altogether if there are overlapping pages or if the pagination mechanism encounters any issues.

You can also modify your query to include an ORDER BY clause based on a unique identifier or a stable column in your dataset. This will ensure consistent ordering of the results and may help eliminate the issue of duplicates. Using explicit ordering in your query should provide a more consistent and reliable result.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BigQuery Java读取大型查询的可靠性

问题

答案1

TarsosDSP在Android上应用低通滤波器并保存为wav会产生不流畅的结果。

随机达特程序

Java Swing 的 DrawRect：创建新矩形会替换旧的尺寸。

Showing java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2 in some test cases?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论