收集大表的统计信息

huangapple go评论53阅读模式
英文:

Gather stats on Large table

问题

我们有一个包含近800亿条记录的非常大的表,只有一个哈希分区。有没有关于如何在这个表上收集统计信息的建议,因为它需要一周的时间才能完成,或者在过程中出现"快照太旧"错误而失败。

我正在使用以下参数来收集统计信息:

BEGIN dbms_stats.gather_table_stats(ownname => 'ABC' tabname =>'table1', estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE, cascade => DBMS_STATS.AUTO_CASCADE, method_opt => 'FOR ALL COLUMNS SIZE AUTO', degree => DBMS_STATS.DEFAULT_DEGREE); END;

Oracle版本:Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.17.0.0.0

英文:

We have very large table with almost 80 Billion records with one hash paritition.
any suggestions how to gather stats on this table as it is taking a week to get completed or failed in between with snapshot too old error.

I am using below parameter to gather stats:

BEGIN  dbms_stats.gather_table_stats(ownname => ABC'  tabname =>'table1',  estimate_percent => 
DBMS_STATS.AUTO_SAMPLE_SIZE, 
cascade => DBMS_STATS.AUTO_CASCADE,  method_opt => 'FOR ALL COLUMNS SIZE AUTO',  degree => 
DBMS_STATS.DEFAULT_DEGREE); 
END;

Oracle version:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.17.0.0.0

答案1

得分: 1

有代表性的统计数据的原因是为了获得良好的基数估计,从而获得良好的计划。由于总会存在一些特殊情况,因此很难进行一般性的概括。但需要注意以下几点:

  • 在最近的版本中,统计信息收集的质量和性能都有一些改进。然而,您需要使用AUTO SAMPLE SIZE而不是estimate_percent来使用它们。
  • 如果您有足够的资源支持,16个并行度才会有用。
  • 单个哈希分区没有意义。哈希分区的数量应该是2的幂,并且应该在具有大量不同值的列上进行,最好是主键列。但首先应该回答“为什么要进行哈希分区”的问题。
  • 如果您的表或可以以其他方式进行分区(例如按范围或列表)并且可以受益于您的查询,则还可以使用增量统计信息。这将消除在整个表上(重新)收集统计信息的需要,只需在已更改的分区上进行。全局统计信息将然后从这些分区计算出。
英文:

The reason for having representitive statistics is to get good cardinality estimates, and subsequently good plans. It's also hard to make generalizations as there are always corner cases. But a few things to note:

  • There have been a number of improvements in both quality and performance of stats gathering in recent releases. However you need to use AUTO SAMPLE SIZE as opposed to estimate_percent to use them.
  • A parallel degree of 16 is only going to be useful if you have the resources to support this.
  • A single hash partition makes no sense. The number of hash partition should be a power of two, and be on a column with a large number of distinct values; preferably a primary key. But the question of "why hash partitioning" should be answered first
  • If your table is or can be partition in some other way (such as range or list) on something that could benefit your queries, then you can also then use incremental statistics. This would negate the need to (re)gather stats on the whole table, but only on the partition(s) that have changed. Global stats would then be computed from those.

答案2

得分: 0

有多种可供您使用的速度调整选项:(1) 降低估算百分比,(2) 明确使用并行处理,(3) 请求块抽样,(4) 跳过索引,(5) 通过将所有列的大小设置为1来禁用直方图:

BEGIN
  dbms_stats.gather_table_stats(ownname => 'ABC',
                                tabname => 'TABLE1',
                                estimate_percent => 0.1,
                                block_sample => true,
                                cascade => false,
                                method_opt => 'FOR ALL COLUMNS SIZE 1',
                                degree => 16);
END;

显然,您必须咨询您自己的数据库硬件配置,以确定适当的并行度。一些系统可以处理更多,并行度,而其他系统则较少。

我发现在大型表上,百分比可以设置得远低于1%,例如0.1%,效果非常好,比1快10倍,比默认值(100)快100倍。块抽样可能会在数据分布不均匀且使用极小百分比(如0.001%)时导致一些不正确的结果,但在许多情况下,将其设置为0.1%或更高的水平是可以的,它通过随机选择块而不是行来提高速度,减少了需要读取的块数。最后,直方图成本较高,通常情况下是不必要的。如果后来发现您需要在支持特定重要应用的SQL上使用某列的直方图,因为缺少直方图而导致出现问题,那么您可以明确请求在该列上设置SIZE AUTO,但将其他列保持为1。直方图的收集需要很长时间,因此最好在需要的情况下将其最小化。

英文:

There are various speed tweaks available to you: (1) lower the estimate percent, (2) explicitly use parallelism, (3) request block sampling, (4) bypass indexes, and (5) disallow histograms by setting SIZE 1 on all columns:

BEGIN  
  dbms_stats.gather_table_stats(ownname => ABC'  
                                tabname =>'TABLE1',  
                                estimate_percent => 0.1, 
                                block_sample => true,
                                cascade => false,  
                                method_opt => 'FOR ALL COLUMNS SIZE 1',  
                                degree => 16); 
END;

Obviously you have to consult your own database hardware sizing to determine an appropriate degree of parallelism. Some systems can handle more, others less.

I have found that on massive tables a percent well below 1%, something like 0.1% works just fine, and goes 10x faster than 1, which goes 100x faster than the default (100). Block sampling can lead to some incorrect results if your data is skewed badly and you use a tiny percent (like 0.001%), but in many cases it's fine at a 0.1% percent level or higher, and it does speed things up by randomly picking blocks rather than rows, which reduces the # of blocks that need to be read. Lastly, histograms are expensive, and often they are unneeded. If you find out later that you do need histograms on a particular column to support a particular important application SQL that is getting messed up because of the lack a histogram, then you can explicitly request SIZE AUTO on just that column, but keep all the others at 1. Histograms take a long time to gather, so they are best minimized to on an as-needed basis.

huangapple
  • 本文由 发表于 2023年3月3日 20:48:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75627292.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定