2023年5月17日 18:43:21go评论56阅读模式

英文:

Question about table partitioning in Bigquery

问题

我面临一个关于我的分区表的问题，任何帮助都将不胜感激。假设我有一个名为A的表，按日期字段A_date进行分区，该表包含数十亿行。另外，我有另一个名为B的表，它有一个日期字段B_date，只有几百行。为了本例，假设B_date中的所有值都是"2023-05-01"。

如果我执行以下查询：

SELECT * FROM A

BigQuery（BQ）表示它将处理超过1TB的数据，鉴于行数众多，这是可以预期的。

但是，如果我执行这个查询：

SELECT * FROM A WHERE A_date >= "2023-05-01"

BQ表示它将处理少于1TB的数据。这也是可以预期的，因为我正在使用A_date分区，并且按特定日期进行过滤。

然而，如果我执行这个查询：

SELECT * FROM A WHERE A_date >= (SELECT B_date FROM B LIMIT 1)

BQ表示查询将处理的数据量与我没有使用WHERE条件时的相同，尽管在我的情况下，"SELECT B_date FROM B LIMIT 1"的结果与"2023-05-01"相同。

最初，我以为这可能是BQ的估算问题。然而，我运行了最后两个查询，并检查了“查询结果”选项卡，显示了处理的字节数的差异。

有人能帮我解决这个问题吗？我正在尝试降低查询的成本，但是我无法解决这个问题。

英文:

I am facing a problem with my partitioned table, and any help would be appreciated. Let's assume I have a table called A that is partitioned by the date field A_date, and this table contains billions of rows. Additionally, I have another table called B, which has a date field B_date and only a few hundred rows. For the purpose of this example, let's say all the values in B_date are "2023-05-01."

If I perform the following query:

SELECT * FROM A

BigQuery (BQ) states that it will process more than 1TB of data, which is expected given the large number of rows.

If I perform this query instead:

SELECT * FROM A WHERE A_date &gt;= &quot;2023-05-01&quot;

BQ states that it will process less than 1TB of data. This is also expected since I'm using the A_date partitioning and filtering by a specific date.

However, if I execute this query:

SELECT * FROM A WHERE A_date &gt;= (SELECT B_date FROM B LIMIT 1)

BQ states that the query will process the same amount of data as if I weren't using a WHERE condition, even though the result of "SELECT B_date FROM B LIMIT 1", in my case, is the same as "2023-05-01."

Initially, I thought it might be an estimation issue with BQ. However, I ran both queries (the last two) and checked the "query results" tab, which also showed a difference in the bytes processed.

Could someone help me with this issue? I'm trying to reduce the costs of my queries, but I'm unable to solve this problem.

答案1

得分: 2

这涉及到分区修剪。当您使用确切的日期进行筛选时，数据消耗较少，因为您正在查询特定范围。但是，当您使用选择语句进行查询时，因为值可能是动态的，整个表都会被扫描。

关于这个问题的参考可以在这个文档中找到 -
https://cloud.google.com/bigquery/docs/querying-partitioned-tables#use_a_constant_filter_expression

您可以声明一个变量来获取日期，并使用该变量来筛选日期，结果类似于

SELECT * FROM A WHERE A_date >= "2023-05-01";

为了利用分区，您可以执行以下操作：

DECLARE dateB TIMESTAMP;
SET dateB = (SELECT B_date FROM B LIMIT 1);
SELECT * FROM A WHERE A_date >= dateB;

英文:

This has to do with the partition pruning.
When you filter with the exact date, the data consumption is less because you are querying a specific range. But when you query using a select statement, because the value can be dynamic, the entire table is scanned.

The reference for this can be found in this doc -
https://cloud.google.com/bigquery/docs/querying-partitioned-tables#use_a_constant_filter_expression

You can declare a variable to get the date and use that variable to filter the date and it would give you results similar to

SELECT * FROM A WHERE A_date &gt;= &quot;2023-05-01&quot;

To make use of partition, you can do the following:

DECLARE dateB TIMESTAMP;
SET dateB =  (SELECT B_date FROM B LIMIT 1);
SELECT * FROM A WHERE A_date &gt;= dateB;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

关于Bigquery中的表分区的问题

问题

答案1

Login to bigquery from golang using json keyfile

BigQuery.go导出作业比WebGUI慢得多。

生成指定长度的随机字符串在Google Cloud SQL和/或BigQuery中如何实现？

如何在BigQuery中使用UNNEST将行展开，但保留某些字段在一行中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论