关于Bigquery中的表分区的问题

huangapple go评论51阅读模式
英文:

Question about table partitioning in Bigquery

问题

我面临一个关于我的分区表的问题,任何帮助都将不胜感激。假设我有一个名为A的表,按日期字段A_date进行分区,该表包含数十亿行。另外,我有另一个名为B的表,它有一个日期字段B_date,只有几百行。为了本例,假设B_date中的所有值都是"2023-05-01"。

如果我执行以下查询:

SELECT * FROM A

BigQuery(BQ)表示它将处理超过1TB的数据,鉴于行数众多,这是可以预期的。

但是,如果我执行这个查询:

SELECT * FROM A WHERE A_date >= "2023-05-01"

BQ表示它将处理少于1TB的数据。这也是可以预期的,因为我正在使用A_date分区,并且按特定日期进行过滤。

然而,如果我执行这个查询:

SELECT * FROM A WHERE A_date >= (SELECT B_date FROM B LIMIT 1)

BQ表示查询将处理的数据量与我没有使用WHERE条件时的相同,尽管在我的情况下,"SELECT B_date FROM B LIMIT 1"的结果与"2023-05-01"相同。

最初,我以为这可能是BQ的估算问题。然而,我运行了最后两个查询,并检查了“查询结果”选项卡,显示了处理的字节数的差异。

有人能帮我解决这个问题吗?我正在尝试降低查询的成本,但是我无法解决这个问题。

英文:

I am facing a problem with my partitioned table, and any help would be appreciated. Let's assume I have a table called A that is partitioned by the date field A_date, and this table contains billions of rows. Additionally, I have another table called B, which has a date field B_date and only a few hundred rows. For the purpose of this example, let's say all the values in B_date are "2023-05-01."

If I perform the following query:

SELECT * FROM A

BigQuery (BQ) states that it will process more than 1TB of data, which is expected given the large number of rows.

If I perform this query instead:

SELECT * FROM A WHERE A_date >= "2023-05-01"

BQ states that it will process less than 1TB of data. This is also expected since I'm using the A_date partitioning and filtering by a specific date.

However, if I execute this query:

SELECT * FROM A WHERE A_date >= (SELECT B_date FROM B LIMIT 1)

BQ states that the query will process the same amount of data as if I weren't using a WHERE condition, even though the result of "SELECT B_date FROM B LIMIT 1", in my case, is the same as "2023-05-01."

Initially, I thought it might be an estimation issue with BQ. However, I ran both queries (the last two) and checked the "query results" tab, which also showed a difference in the bytes processed.

Could someone help me with this issue? I'm trying to reduce the costs of my queries, but I'm unable to solve this problem.

答案1

得分: 2

这涉及到分区修剪。当您使用确切的日期进行筛选时,数据消耗较少,因为您正在查询特定范围。但是,当您使用选择语句进行查询时,因为值可能是动态的,整个表都会被扫描。

关于这个问题的参考可以在这个文档中找到 -
https://cloud.google.com/bigquery/docs/querying-partitioned-tables#use_a_constant_filter_expression

您可以声明一个变量来获取日期,并使用该变量来筛选日期,结果类似于

SELECT * FROM A WHERE A_date >= "2023-05-01";

为了利用分区,您可以执行以下操作:

DECLARE dateB TIMESTAMP;
SET dateB = (SELECT B_date FROM B LIMIT 1);
SELECT * FROM A WHERE A_date >= dateB;

英文:

This has to do with the partition pruning.
When you filter with the exact date, the data consumption is less because you are querying a specific range. But when you query using a select statement, because the value can be dynamic, the entire table is scanned.

The reference for this can be found in this doc -
https://cloud.google.com/bigquery/docs/querying-partitioned-tables#use_a_constant_filter_expression

You can declare a variable to get the date and use that variable to filter the date and it would give you results similar to

SELECT * FROM A WHERE A_date >= "2023-05-01"

To make use of partition, you can do the following:

DECLARE dateB TIMESTAMP;
SET dateB =  (SELECT B_date FROM B LIMIT 1);
SELECT * FROM A WHERE A_date >= dateB;

huangapple
  • 本文由 发表于 2023年5月17日 18:43:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271226.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定