英文:
Non-Clustered Indexes: Why is a B-Tree even needed?
问题
我理解为什么对于聚簇索引(维护排序顺序)而言会这样。对于非聚簇索引,你可以应用哈希函数来找到包含该行的页面。假设你有:
- 1519 行
- 2100 字节每行
- 64 千字节每页
那么页面的数量是 1519 * 2100 / (64 * 1024) = 48 [+ 余数]。
你可以使用 % 49 来找到包含该行的页面。
(是的,我知道这有点复杂,因为页面可能会溢出,因为哈希不会完全均匀,我们有时需要分割页面,类似于在哈希环中的现有节点之间添加新节点一半那样。不过,这似乎比B-Tree更好。)
英文:
I understand why for Clustered Indexes (maintain sorted order).
For Non-Clustered, though, couldn't you apply a hash function be applied to find the page containing the row? Let's say you have
- 1519 rows
- 2100 bytes per row
- 64 kilobytes per page
Then the number of pages is 1519 * 2100 / (64 * 1024) = 48 [+ remainder].
You can just % 49 the row id to find the page containing the row.
(Yes, I know it's a little more complicated because pages could overflow because hashing will not be perfectly uniform, that we will need to sometimes split pages similar to adding new node halfway between existing nodes in a hash ring, etc. Still, it seems better than B-Tree.)
答案1
得分: 1
为什么需要 B 树?
它不是必需的 - 但对于 SQL Server 来说,这是实现行存储磁盘表上的非聚集索引的方式。潜在地,他们本可以在这里提供 一个选项 用于哈希索引,但他们没有这样做。
Postgres 确实拥有 哈希索引。
SQL Server 中的 内存 OLTP 实现 也支持哈希索引或范围索引(BW 树)。
哈希索引仅适用于整个键上的等值操作。
B 树远不止这种简单的键/值类型用例,还可以支持范围搜索,或者对复合键的前导列进行搜索,或者以特定排序顺序返回结果。
英文:
> Why is a B-Tree even needed?
It isn't - but for SQL Server that is how non clustered indexes on rowstore on-disc tables is implemented. Potentially they could have provided an option for hash indexes here but they haven't.
Postgres does have hash indexes
The in memory OLTP implementation in SQL Server also supports hash indexes or range indexes (BW tree).
Hash indexes are only good for equality type operations on the whole key.
B trees are significantly more versatile than just this simple key/value type use case and can also support range searches or searches on leading columns of a composite key or bringing back results in a particular sort order.
答案2
得分: -1
索引很少被开发人员充分理解。这里是对索引历史和性质的一般解释...
在中世纪,地址的概念(例如到达特定地点,比如一座房子)非常模糊。农民不会自己移动,贵族会旅行,但只会去其他贵族,其住所是众所周知的。只有村庄有名称。这已经是一种索引的第一形式:村庄的名称加上地图可以更快地找到所需的位置!
然而,在17世纪的巴黎,人们确信有必要给大城市的街道命名以更好地辨认自己。1728年,人们在房屋上放置了牌匾,以指示每条街道的名称。地点的索引随后变得更加精确。
1805年,巴黎决定对房屋进行编号。不是随机的,而是按照特定顺序。奇数号在左边,偶数号在右边,从最靠近塞纳河的街道尽头开始。这个算法使得快速找到住所的确切位置成为可能。它仍然在使用中...
1964年添加邮政编码后,邮件递送变得更加迅速...
所有这些信息都无法用来找到特定位置。它们只是为了更快地找到它们。这就是索引的全部目的...
请注意,索引不止一种,而是有十几种。
在我们的地址案例中,带有纬度和经度的地理坐标也是一种地址形式,对于人类来说不太容易解释,但对于GPS来说非常容易操作!
同样的情况也发生在电话上。起初,人们称之为总机,并由当地操作员将您与您的通话对象所在的城市中央连接起来。然后远程操作员会询问您想与哪位订户通话...
随着电话的普及,号码出现了。操作员无法知道所有订户的姓名是不可能的。然后操作员被电机控制单元取代,然后被电子交换机取代... 今天一切都是在电子索引表中完成的...
实际上,我们被索引包围,而我们并不知道。您的地址、电子邮件、电话号码、社会保障代码都是索引形式。您绝对不需要它们,但它们只是为了节省时间而不可或缺。
在计算机科学中,对索引概念的严格定义如下:
特定的数据结构,包含冗余信息,特别组织用于加速某些搜索
因此,有许多类型的索引,更或多或少适用于我们希望快速找到的信息。
主要在数据库中使用的最常见算法有:
- B树
- 哈希
- 位图
B树 是原子数据的完美索引。此索引类型可以与许多运算符一起使用,例如:
- =(相等)
- <,<=,>,>=(不等式)
- BETWEEN(范围)
- DISTINCT / GROUP BY / PARTITION BY
- ORDER BY
- LIKE '模式%'
但对于以下运算符,它不是有效的:
- <>
- LIKE '%模式'
- LIKE '%模式%'
它可以用于一组列... 但它占用了大量空间,因为数据是原始数据的严格副本。
哈希 索引只适用于:
- =(相等)
- DISTINCT / GROUP BY / PARTITION BY
因为它将数据转换为不保留相同顺序的其他数据。但由于哈希替换了原始值,当原始数据有点长时,它可能占用更少的空间。只能索引一列。
位图 索引将不同值的集合替换为二进制字中的位。它可以像哈希索引一样使用:
- =(相等)
- DISTINCT / GROUP BY / PARTITION BY
如果不同值的数量很少,它甚至需要更少的空间。这种索引类型的另一个优点是,它可以轻松并行化(查询可以使用多个线程在位图索引中进行筛选)。同样只能索引一列。
您必须了解,任何索引都由特殊的数据结构和特定的算法组成,以使用这种特殊结构,并快速找到您正在查找的数据...
因此,每个索引都需要一个数据结构,而CLUSTERED或NONCLUSTERED SQL Server索引都使用了B树+,+是B树索引的一种改进,它在每个级别都添加了一个索引页的链接列表。
英文:
Indexes are rarely well understood by developers. Here is a general explanation of the history of indexes and their nature...
In the Middle Ages, the notion of address (to get to a specific place, a house for example) was very vague. The peasants did not move, alone, the aristocracy traveled, but only went to other aristocrats whose residence was well known. Only the villages had a name. It's already a first form of index: name of the village plus map provides quicker access to the desired location!
Nevertheless, in the 17th century in Paris, people were convinced that it was necessary to give a name to the streets of the big cities to better identify themselves. In 1728 plaques were placed on the houses to indicate the name of each street. The indexing of places then became more precise.
In 1805, it was decided in Paris to number the houses. Not randomly, in a specific order. The odd numbers on the left, the even numbers on the right and starting at the end of the street closest to the Seine. This algorithm then made it possible to quickly find the exact location of a dwelling. It is still used today...
By adding the postal code in 1964, the mail became faster to deliver...
All this information is useless to find a specific place. They are just essential to find it quickly. That's all the point of the index...
Note that there is not one type of index, but tens.
In the case of our addresses, a geographic coordinate with its latitude and longitude is also a form of address, less easy to interpret for humans, but very easy to manipulate for GPS!
The same thing happened on the phone. At first, a switchboard was called and a local operator connected you with the central of the city in which the recipient of your call was located. The remote operator then asked you for the name of the subscriber with whom you wanted to talk...
With the democratization of the telephone, numbers arrived. It was impossible for the operators to know all the names of the subscribers. Then the operators were replaced by electromechanical control units, then by electronic switches... Today everything is done electronically with index tables...
In fact we are surrounded by indexes without knowing it. Your address, your email, your phone numbers, your social security code are forms of index. You don't absolutely need them, but they are just essential so you don't waste time.
In computer science, a strict definition of the notion of index is as follows:
Specific data structure, containing redundant information, specially organized to speed up certain searches
There are therefore many types of index more or less adapted to the information that we want to find quickly.
Mainly the most common algorithms in databases are:
- BTree
- Hash
- Bitmap
BTree is a perfect index for atomic data. This index type can be used with many operators like :
- = (equality)
- <, <=, >, >= (inequality)
- BETWEEN (range)
- DISTINCT / GROUP BY / PARTITION BY
- ORDER BY
- LIKE 'pattern%'
but it is not effective on the following operators:
- <>
- LIKE '%pattern'
- LIKE '%pattern%'
And it can be used in a set of columns...
But it take a lot of place, because the data is a strict copy of the original one.
Hash indexes are only suitable for :
- = (equality)
- DISTINCT / GROUP BY / PARTITION BY
Because it transforms data into another data that does not preserve the same ordering
But because a hash replace the original value, it can take less space when the original data is a little bit long. Only one column can be indexed.
Bitmap indexes replace the collections of the differents values by a bit into a binary word. It can be used the same ways as hash indexes :
- = (equality)
- DISTINCT / GROUP BY / PARTITION BY
And takes even less space if the number of different values is small. One more advantage of this index type, is that it can easiliy be parallelized (the query can use multiple threads to filter in a bitmap index). ALso only one column can be indexed.
You must understand that any indexes is compound of a special data structure and a specific algorithm, to use this special structure, and find quickly the data you are searching...
So every index needs a data structure and CLUSTERED or NONCLUSTERED SQL Server indexes both uses a BTree+, + is a refinement of BTree index that add a linked list of index pages at evry level.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论