英文:
Composite Indexes, the “Include” Keyword, and How They Work
问题
在SQL Server(以及大多数其他关系数据库)中,"复合索引"是一个具有多个键的索引。假设我们有一个经常运行的查询,我们想为这个查询创建一个覆盖索引以加速它:
SELECT a, b FROM MyTable WHERE c = @val1 AND d = @val2
以下是可以覆盖此查询的所有可能的复合索引:
CREATE INDEX ix1 ON MyTable (c, d, a, b)
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
CREATE INDEX ix3 ON MyTable (d) INCLUDE (a, b, c)
CREATE INDEX ix4 ON MyTable (c) INCLUDE (a, b, d)
但显然它们的性能不同。根据Erland Sommarskog(Microsoft MVP)的说法,前两个比第3和第4个快,第4比第3个快。
他接着解释道:
ix2是"最佳"索引,因为a和b不会占据索引树的较高级别的空间。此外,如果更新a或b,在ix2中不会发生分页拆分或类似情况,因为索引树不受影响。
然而,我很难理解到底发生了什么。我了解B树索引以及它们的工作方式的一般知识,但我不理解复合键背后的逻辑。例如:
CREATE INDEX ix1 ON MyTable (c, d, a, b)
这里列的顺序重要吗?如果是的话,为什么?另外;
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
这个复合键与上面的有什么不同?我不明白"INCLUDE"会产生什么区别。
注意:我知道有很多关于复合键的帖子,但我认为我最后两个问题足够具体,不会重复。
英文:
In SQL Server (and most other relational databases), a "Composite Index" is an index with multiple keys. Let's say we have this query that gets run a lot, and we want to create a covering index for this query to speed it up;
SELECT a, b FROM MyTable WHERE c = @val1 AND d = @val2
These are all possible composite indexes that would cover this query;
CREATE INDEX ix1 ON MyTable (c, d, a, b)
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
CREATE INDEX ix3 ON MyTable (d) INCLUDE (a, b, c)
CREATE INDEX ix4 ON MyTable (c) INCLUDE (a, b, d)
But apparently, they don't perform equally. According to Erlan Sommarskog (Microsoft MVP), the first two are faster than the 3rd and 4th, and the 4th is faster than the 3rd.
He goes on to explain;
> ix2 is the "best" index, because a and b will not take up space in the higher levels of the index tree. Also, if a or b are updated, in ix2 there can be no page splits or similar as the index tree is unaffected.
However, I am having a hard time grasping what exactly is going on. I do have the general knowledge on b-tree indexes and how they work, but I don't understand the logic behind composite keys. For example;
CREATE INDEX ix1 ON MyTable (c, d, a, b)
Does the order of the columns here matter? If so, why? Also;
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.
Note: I know there are a lot of posts on Composite Keys, but I believe my last two questions are specific enough to not be a duplicate.
答案1
得分: 1
在创建索引时,列的顺序非常重要,因为每个列(从左至右)在索引中代表了不同的深度级别,因此要确定编译器使用这个索引,你总是需要首先查找c,它是这组中的“开头”。
至于第二个示例中的复合键和第一个示例有什么不同,"INCLUDE" 的区别在于,如果你知道超过80%的查询只会根据c和d进行搜索,而不会根据a和b进行搜索,但你仍然需要在SELECT语句中使用a和b的信息(而不是在WHERE子句中),那么你应该将它们作为索引的最后一级的叶子部分包含在内。
关于这个话题有更好的解释,你可以参考以下链接:
https://stackoverflow.com/questions/5108651/include-equivalent-in-oracle -> INCLUDE
https://stackoverflow.com/questions/2292662/how-important-is-the-order-of-columns-in-indexes -> ORDER in INDEX set
英文:
CREATE INDEX ix1 ON MyTable (c, d, a, b)
> Does the order of the columns here matter? If so, why? Also;
Yes, order is very important while creating index, because each column is (from left) next level of deepness in index, so to determine the compilator to use this index you need always seek for c which is the "opener" of this set.
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
> What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.
But keep in mind that for each level of the index it starts to be less efficient, so if you know that > 80% of your queries will only seek by c & d and not a & b, but you will need that information in your SELECT (nor in WHERE) you should INCLUDE them, as part of the leaf at the last level of the index.
There are better explanations than mine so feel free to look at them:
https://stackoverflow.com/questions/5108651/include-equivalent-in-oracle -> INCLUDE
https://stackoverflow.com/questions/2292662/how-important-is-the-order-of-columns-in-indexes -> ORDER in INDEX set
答案2
得分: 1
关于这里列的顺序是否重要的问题:
考虑只有2个等值谓词的查询,复合索引键列的顺序不重要,只要它们都是复合索引的左侧键列。以下任何一个覆盖索引都将优化此查询:
CREATE INDEX ix1 ON MyTable (c, d, a, b);
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b);
CREATE INDEX ix3 ON MyTable (d, c, a, b);
CREATE INDEX ix4 ON MyTable (d, c, b, a);
CREATE INDEX ix5 ON MyTable (d, c) INCLUDE (a, b);
尽管如此,统计直方图仅包含最左侧的索引键列,因此一般的建议是首先指定最具选择性的列,以改进行数估计和执行计划质量。这一考虑对于非平凡查询更为重要,其中优化器有很多选择,并且行数估计是选择最佳计划的重要因素。
另一个关于键列顺序的考虑,可能与上述一般建议相冲突的情况是,索引支持不同的查询,且只有一些键列被指定(例如 SELECT a, b FROM MyTable WHERE d = @val2;
)。在这种情况下,最好将 d
指定为最左侧的列,无论选择性如何,以允许单个索引优化多个查询,而不是创建一个单独的索引来优化第二个查询。
关于这个复合键和上面的有什么区别的问题,我不明白 "INCLUDE" 有什么不同。
包含的列不是键列。键列在整个B树的每个级别中按逻辑顺序维护,而包含的列仅出现在B树叶节点中,而且没有顺序。因此,包含列的指定顺序不重要。包含列的唯一目的是帮助覆盖查询,而不将它们添加为键列并带来相关的开销。
英文:
> Does the order of the columns here matter?
Considering only the query in your question with 2 equality predicates, the order of the composite index key columns doesn't matter as long as both are the leftmost key columns of the composite index. Any of the covering indexes below will optimize this query:
CREATE INDEX ix1 ON MyTable (c, d, a, b);
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b);
CREATE INDEX ix3 ON MyTable (d, c, a, b);
CREATE INDEX ix4 ON MyTable (d, c, b, a);
CREATE INDEX ix5 ON MyTable (d, c) INCLUDE (a, b);
That said, the stats histogram contains only the leftmost index key column so the general guidance is to specify the most selective column first to improve row count estimates and execution plan quality. This consideration is more important for non-trivial queries where the optimizer has many choices and row count estimates are an important factor in choosing the best plan.
Another consideration for key order, which may conflict with the above general guidance, is when the index supports different queries and only some of the key columns are specified (e.g. SELECT a, b FROM MyTable WHERE d = @val2;
). In that case, it would be better to specify d
as the leftmost column regardless of selectivity in order to allow a single index to optimize multiple queries instead of creating a separate index to optimize the second query.
> What is the difference between this composite key and the one above? I
> don't understand what difference "INCLUDE" makes.
Included columns are not key columns. Key columns are maintained in logical order at every level throughout the b-tree whereas included columns are present only in the b-tree leaf nodes and not ordered. Consequently, the specified order of included columns does not matter. The only purpose of included columns is to help cover queries without adding them as key columns and incurring the associated overhead.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论