连接两个大小不同的表格需要数小时才能执行。

huangapple go评论94阅读模式
英文:

Joining two tables with different sizes takes hours to execute

问题

以下是翻译的内容:

这是情景:
我们正在使用Azure Synapse Analytics,具有100个DTU,我们有多个表。
所有表都是循环分布和分簇列存储,因为数据每天加载一次,我们希望它能够快速运行。
所有这些表都很大,有50多个列和数百万条记录(约30万条)。
有一个名为load_table的表,其中包含有关所有表的最后加载的信息。

SQL查询1:

SELECT load_time FROM load_table WHERE table = 't2';

当我在表上执行此查询时,它会为该表给出一个时间戳作为输出。格式如下:"YYYY-MM-DD TT"。此查询通常在5-10秒内正常运行。

SQL查询2:

SELECT top 100 * FROM t2;

t2是包含数百万条记录的大型事实表,具有50多列。当我运行此查询时,需要2-3分钟。

问题出现在我将这两个表连接在一起时,执行时间需要2-3小时。
我希望结果的第一列是加载时间,然后是第二个事实表的所有列。

连接方式如下:

SELECT t1.load_time, t2.* FROM
(SELECT load_time FROM load_table WHERE table = 't2') as t1 INNER JOIN t2
on 1 = 1; 

输出如下所示。
连接两个大小不同的表格需要数小时才能执行。

我应该如何以另一种方式实现这个结果?

我期望以下结果
连接两个大小不同的表格需要数小时才能执行。

我已检查事实表的统计信息,它们已更新。
我已检查偏斜,一切正常。

英文:

Here is the scenario:
we are using Azure Synapse analytics with 100 DTU, we have several tables.
All tables are round robin distributed and clustered columnstore because data is loaded once in a day and we want it to be quick.
All these tables are big having 50+ columns and millions of records (around 30m)
There is one load_table table which contains information about last load of all the tables.

SQL Query1:

SELECT load_time FROM load_table WHERE table = 't2';

when I execute this query on table, this gives me one timestamp as output for that table. format is like this "YYYY-MM-DD TT". This query normally runs fine within 5-10 secs.

SQL Query2:

SELECT top 100 * FROM t2;

t2 is big fact table containing millions of records having 50+ cols. When I run this query it takes 2-3 mins.

Problem occurs when I join these two tables and it takes 2-3 hrs to execute.
I want the result like loadtime as first column and all columns from second fact table.

joining is like below


SELECT t1.load_time, t2.* FROM
(SELECT load_time FROM load_table WHERE table = 't2') as t1 INNER JOIN t2
on 1 = 1; 

Output look like below.
连接两个大小不同的表格需要数小时才能执行。

How can I achieve this result another way?

I am expecting below results
连接两个大小不同的表格需要数小时才能执行。

I have checked statistics of fact table those are updated
I have checked skew and it is fine

答案1

得分: 1

你的连接条件有点特殊。
如果load_table每个表只返回一个值('t2'),你可以尝试这样做:

SELECT TOP 100 
t1.load_time,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM load_table as t1 
CROSS APPLY t2 
WHERE t1.table='t2'

如果你使用存储过程,你可以先将值保存到变量中:

DECLARE @TableTimestamp datetime;
SELECT TOP 1 @TableTimestamp = load_time FROM load_table;
SELECT TOP 100
@TableTimestamp AS LoadTime,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM t2
英文:

your join-condition is a bit...special.
If your load_table only returns one value per table ('t2'), then you can just try this:

SELECT TOP 100 
t1.load_time,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM load_table as t1 
CROSS APPLY t2 
WHERE t1.table='t2'

and if you use a stored procedure, you can first save the value to a variable:

DECLARE @TableTimestamp datetime;
SELECT TOP 1 @TableTimestamp = load_time from load_table
;
SELECT TOP 100
@TableTimeStamp as LoadTime
,t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM t2

</details>



# 答案2
**得分**: 0

当我在表上执行此查询时,它会给我一个时间戳作为输出结果。

当它只返回一行一列时,为什么不将其存储在变量中。
声明 @load_time datetime

SELECT @load_time=load_time FROM load_table WHERE table = 't2';

现在,

SELECT @load_time as load_time, t2.* FROM t2 (NOLOCK)

i) 不要使用 `*` 并明确提及您在输出中需要的所有列名。

ii) 只有在不存在 `READ UNCommitted` 的可能性或者不重要的情况下才使用 `nolock`。

iii) 为什么要使用数百万条数据,为什么不使用分页或 `top` 子句来限制结果集。

iv) 当然,我们至少需要:表和索引的定义以进行进一步的调查。同时分享您的实际查询。

如果您简单地使用

    select * from t2

其中包含数百万行,它将总是需要时间。
此外,
`select top 100 * from t2` 也需要时间,因为它没有 order by 子句,SQL 优化器不确定按哪个列排序以获取这些 100 行。

因此,您必须共享表结构以及数据类型。另外,如果表上有任何索引或唯一键定义,也请一并分享。

然后,我们可以根据需要进行分页以获取所选行。

<details>
<summary>英文:</summary>

&gt; when I execute this query on table, this gives me one timestamp as output for that table

when it return only one row and one column then why not take this in variable.
Declare @load_time datetime

SELECT @load_time=load_time FROM load_table WHERE table = &#39;t2&#39;;

Now,


SELECT @load_time as load_time, t2.* FROM t2 (NOLOCK)

i) DO not use `*` and  mention all the column name which you need in your output.

ii) Use `nolock` only when there is no chance of `READ UNCommitted` or it do not matter.

iii) Why you will use million of data,why not limit your resultset by using Pagination or `top` clause.

iv) Of course , we need at a minimum: the table and index definition for further investigation.also share your real query


 if you simply use 

    select * from t2

 
which has millions of rows it will always take time.
Also,
`select top 100 * from t2` will also take time because it is without order by clause,sql optimizer is not sure to order on which column to get those 100 rows.

So you have to share table scheme along with data type.Also if there is any index on table or unique key define.

Then accordingly we can do Paging to get only selected rows.



</details>



huangapple
  • 本文由 发表于 2023年5月8日 00:03:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76194957.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定