2023年5月8日 00:03:37go评论122阅读模式

英文:

Joining two tables with different sizes takes hours to execute

问题

以下是翻译的内容：

这是情景：
我们正在使用Azure Synapse Analytics，具有100个DTU，我们有多个表。
所有表都是循环分布和分簇列存储，因为数据每天加载一次，我们希望它能够快速运行。
所有这些表都很大，有50多个列和数百万条记录（约30万条）。
有一个名为load_table的表，其中包含有关所有表的最后加载的信息。

SQL查询1：

SELECT load_time FROM load_table WHERE table = 't2';

当我在表上执行此查询时，它会为该表给出一个时间戳作为输出。格式如下："YYYY-MM-DD TT"。此查询通常在5-10秒内正常运行。

SQL查询2：

SELECT top 100 * FROM t2;

t2是包含数百万条记录的大型事实表，具有50多列。当我运行此查询时，需要2-3分钟。

问题出现在我将这两个表连接在一起时，执行时间需要2-3小时。
我希望结果的第一列是加载时间，然后是第二个事实表的所有列。

连接方式如下：

SELECT t1.load_time, t2.* FROM
(SELECT load_time FROM load_table WHERE table = 't2') as t1 INNER JOIN t2
on 1 = 1;

输出如下所示。

我应该如何以另一种方式实现这个结果？

我期望以下结果

我已检查事实表的统计信息，它们已更新。
我已检查偏斜，一切正常。

英文:

Here is the scenario:
we are using Azure Synapse analytics with 100 DTU, we have several tables.
All tables are round robin distributed and clustered columnstore because data is loaded once in a day and we want it to be quick.
All these tables are big having 50+ columns and millions of records (around 30m)
There is one load_table table which contains information about last load of all the tables.

SQL Query1:

SELECT load_time FROM load_table WHERE table = &#39;t2&#39;;

when I execute this query on table, this gives me one timestamp as output for that table. format is like this "YYYY-MM-DD TT". This query normally runs fine within 5-10 secs.

SQL Query2:

SELECT top 100 * FROM t2;

t2 is big fact table containing millions of records having 50+ cols. When I run this query it takes 2-3 mins.

Problem occurs when I join these two tables and it takes 2-3 hrs to execute.
I want the result like loadtime as first column and all columns from second fact table.

joining is like below


SELECT t1.load_time, t2.* FROM
(SELECT load_time FROM load_table WHERE table = &#39;t2&#39;) as t1 INNER JOIN t2
on 1 = 1;

Output look like below.

How can I achieve this result another way?

I am expecting below results

I have checked statistics of fact table those are updated
I have checked skew and it is fine

答案1

得分: 1

你的连接条件有点特殊。
如果load_table每个表只返回一个值（'t2'），你可以尝试这样做：

SELECT TOP 100 
t1.load_time,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM load_table as t1 
CROSS APPLY t2 
WHERE t1.table='t2'

如果你使用存储过程，你可以先将值保存到变量中：

DECLARE @TableTimestamp datetime;
SELECT TOP 1 @TableTimestamp = load_time FROM load_table;
SELECT TOP 100
@TableTimestamp AS LoadTime,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM t2

英文:

your join-condition is a bit...special.
If your load_table only returns one value per table ('t2'), then you can just try this:

SELECT TOP 100 
t1.load_time,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM load_table as t1 
CROSS APPLY t2 
WHERE t1.table=&#39;t2&#39;

and if you use a stored procedure, you can first save the value to a variable:

DECLARE @TableTimestamp datetime;
SELECT TOP 1 @TableTimestamp = load_time from load_table
;
SELECT TOP 100
@TableTimeStamp as LoadTime
,t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM t2
</details>
# 答案2
**得分**: 0
当我在表上执行此查询时，它会给我一个时间戳作为输出结果。
当它只返回一行一列时，为什么不将其存储在变量中。
声明 @load_time datetime
SELECT @load_time=load_time FROM load_table WHERE table = 't2';
现在，
SELECT @load_time as load_time, t2.* FROM t2 (NOLOCK)
i) 不要使用 `*` 并明确提及您在输出中需要的所有列名。
ii) 只有在不存在 `READ UNCommitted` 的可能性或者不重要的情况下才使用 `nolock`。
iii) 为什么要使用数百万条数据，为什么不使用分页或 `top` 子句来限制结果集。
iv) 当然，我们至少需要：表和索引的定义以进行进一步的调查。同时分享您的实际查询。
如果您简单地使用
    select * from t2
其中包含数百万行，它将总是需要时间。
此外，
`select top 100 * from t2` 也需要时间，因为它没有 order by 子句，SQL 优化器不确定按哪个列排序以获取这些 100 行。
因此，您必须共享表结构以及数据类型。另外，如果表上有任何索引或唯一键定义，也请一并分享。
然后，我们可以根据需要进行分页以获取所选行。
<details>
<summary>英文:</summary>
&gt; when I execute this query on table, this gives me one timestamp as output for that table
when it return only one row and one column then why not take this in variable.
Declare @load_time datetime
SELECT @load_time=load_time FROM load_table WHERE table = &#39;t2&#39;;
Now,
SELECT @load_time as load_time, t2.* FROM t2 (NOLOCK)
i) DO not use `*` and  mention all the column name which you need in your output.
ii) Use `nolock` only when there is no chance of `READ UNCommitted` or it do not matter.
iii) Why you will use million of data,why not limit your resultset by using Pagination or `top` clause.
iv) Of course , we need at a minimum: the table and index definition for further investigation.also share your real query
 if you simply use 
    select * from t2
 
which has millions of rows it will always take time.
Also,
`select top 100 * from t2` will also take time because it is without order by clause,sql optimizer is not sure to order on which column to get those 100 rows.
So you have to share table scheme along with data type.Also if there is any index on table or unique key define.
Then accordingly we can do Paging to get only selected rows.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

连接两个大小不同的表格需要数小时才能执行。

问题

答案1

SQL Server – 为大表创建多列索引？还是为每个报告创建一个索引？

无法在vsCode中使用Azure Developer CLI（azd），即使安装了扩展也不行。

.NET 7 Azure Functions：主机尚未启动

需要在Azure Data Factory中生成具有动态键的JSON文件。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。