问题

我有一个进程，每天手动在一组Redshift表上运行“VACUUM”以保持查询性能一致。但有时，对一个表进行清理需要大约2小时。这正常吗？我在考虑清理运行时间对并发查询的影响。

英文:

I have a process that runs VACUUM manually on a list of redshift tables on a daily basis to maintain consistent query performance. But sometimes, vacuuming one table takes about 2 hours. Is this normal? I was thinking of the effect of the vacuum runtime on the concurrent queries.

答案1

得分: 1

为了回答这个问题，我需要解释一些基本的Redshift知识 - 块、表、切片、排序、VACUUM。

我写了一份介绍性文件，因为在写了许多相同的信息n次之后，你会开始想“也许我应该制作一个PDF，这样我就可以直接给人们”。

https://www.redshiftresearchproject.org/white_papers/downloads/introduction_to_the_fundamentals_of_amazon_redshift.html

有一个关于VACUUM的特定部分，但你真的需要先阅读前面的内容才能理解它。

https://www.redshiftresearchproject.org/white_papers/downloads/introduction_to_the_fundamentals_of_amazon_redshift.html#vacuum

英文:

To answer this, I need to explain a bunch of foundational Redshift knowledge - blocks, tables, slices, sorting, VACUUM.

I've written an introductory document, because after having written out much the same information n hundred times, you start to think "maybe I should produce a PDF of this so I can just give this to people".

https://www.redshiftresearchproject.org/white_papers/downloads/introduction_to_the_fundamentals_of_amazon_redshift.html

There's a particular section about VACUUM, but you really need to read the stuff before that to understand it.

https://www.redshiftresearchproject.org/white_papers/downloads/introduction_to_the_fundamentals_of_amazon_redshift.html#vacuum

答案2

得分: 1

这可能是正常的，但需要更多信息才能确定。Redshift的vacuum会对表进行排序并回收空间。排序过程受到阈值百分比的限制，该百分比确定是否需要进行排序。默认阈值为95%，所以如果表中95%或更多的块被标记为已排序，则排序将被跳过。如果跳过排序，vacuum将运行得更快。

如果这是一个大表，在超过5%的块发生更改后进行排序可能需要很多工作，可能需要几个小时甚至更长时间。由于您定期运行vacuum，您可能希望每次运行时都对表进行排序，以免工作积压。您可以通过设置阈值来实现这一点。如果将其设置为100%，则Redshift将在每次运行vacuum时重新对表进行排序。

VACUUM

to 100 percent;

英文:

This is likely normal but some more info will be needed to be sure. Redshift vacuum sorts the table and reclaims space. The sort process is limited by the threshold percentage which determines if sorting needs to be done. The default for this is 95% so if 95% or more of the blocks in the table are marked as sorted the sorting will be skipped. If skipped the vacuum will run much faster.

If this is a large table sorting after more than 5% of blocks have been changed may be a lot of work and take a few hours or more. Since you are running vacuum regularly you likely want it to sort the table each time so that the work doesn’t pile up. You can do this by setting the threshold. If you set it to 100 percent then Redshift will resort the table every time vacuum runs.