英文:
Same IDs with different _routing values
问题
根据 Elasticsearch 的文档,可以将具有相同 _id 的文档与不同的 _routing 值索引,因此文档指出 _id 的唯一性不被保证,因为这些文档可能会分布在不同的分片上(这似乎是一种特性而不是错误)。
那么当具有相同 _id 的两个文档以不同的 routing 值索引并且最终位于同一分片上的情况如何呢?考虑下面的查询体:
PUT test-index
{
"settings": {
"index": {
"number_of_shards": 2
}
}
}
PUT test-index/_doc/1?routing=user1
{
"title": "这是具有 routing=user1 的文档编号"
}
PUT test-index/_doc/1?routing=user2
{
"title": "这是具有 routing=user2 的文档编号"
}
GET test-index/_search
搜索查询呈现以下结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "test-index",
"_id": "1",
"_score": 1,
"_routing": "user2",
"_source": {
"title": "这是具有 routing=user2 的文档编号"
}
}
]
}
}
为什么搜索响应只显示了具有 user2 的文档,尽管有 2 个分片?这是因为根据以下公式确定分片编号:
shard_num = (hash(_routing) % num_routing_shards) / routing_factor
其中 routing_factor = num_routing_shards / num_primary_shards
在我的情况下,routing_factor 为 1(即 2 个 routing 分片 / 2 个主分片),因此分片 ID 基本上是 _routing 值 mod 2 的哈希值。
使用您的 routing 值,我们得到以下分片 ID(可以在这里进行 murmur3 实验):
murmur3("user1") % 2 = 3305849917 % 2 = 分片 1
murmur3("user2") % 2 = 4180509323 % 2 = 分片 1
然而,如果具有相同 _id 但包含不同 _routing 值的两个文档最终位于同一分片上,为什么只显示一个文档呢?
英文:
According to elasticsearch documentation, it is possible to have docs with the same _id indexed with different _routing values. Hence, the documentation states that the uniqueness on _id is not guaranteed because these docs can end up on different shards (which appears to be a feature rather than a bug)
How about the scenario when two docs with the same _id indexed with different routing values end up on the same shard? Consider the query body below:
PUT test-index
{
"settings": {
"index":
{
"number_of_shards": 2
}
}
}
PUT test-index/_doc/1?routing=user1
{
"title": "This is document number with routing=user1"
}
PUT test-index/_doc/1?routing=user2
{
"title": "This is document number with routing=user2"
}
GET test-index/_search
The search queries renders the following result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "test-index",
"_id": "1",
"_score": 1,
"_routing": "user2",
"_source": {
"title": "This is document number with routing=user2"
}
}
]
}
}
Why does the search response only shows doc under user2 despite having 2 shards? It is certain that both docs ended up on the same shard because as per formula:
shard_num = (hash(_routing) % num_routing_shards) / routing_factor
where routing_factor = num_routing_shards / num_primary_shards
In my case routing_factor is 1 (i.e. 2 routing shards / 2 primary shards), so the shard ID is basically the hash of the _routing value mod 2.
Using your routing values, we get the following shard IDs (we can experiment murmur3 here):
murmur3("user1") % 2 = 3305849917 % 2 = shard 1
murmur3("user2") % 2 = 4180509323 % 2 = shard 1
However, if both docs with same _id containing different _routing values end up on the same shard, why does it only show one doc?
答案1
得分: 1
因为它们在相同的分片上具有相同的ID,所以第二个查询不是“插入”,而是更新。
证据:
如果按以下顺序执行以下命令:
PUT 76349386
{
"settings": {
"index":
{
"number_of_shards": 2
}
}
}
然后
PUT 76349386/_doc/1?routing=user1
{
"title": "这是具有routing=user1的文档编号"
}
将给你:
{
"_index": "76349386",
"_id": "1",
"_version": 1,
"result": "created", <= 这里表示操作的结果是创建
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
但是当你执行第二个命令时
PUT 76349386/_doc/1?routing=user2
{
"title": "这是具有routing=user2的文档编号"
}
响应将略有不同:
{
"_index": "76349386",
"_id": "1",
"_version": 2,
"result": "updated", <= 这是一个更新。
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}
具有ID 1的文档已被更新。
英文:
Tldr
Because they share the same ID on the same shard, the second query is not an insert
, it is an update.
Evidences:
If you play the following commands in order:
PUT 76349386
{
"settings": {
"index":
{
"number_of_shards": 2
}
}
}
Then
PUT 76349386/_doc/1?routing=user1
{
"title": "This is document number with routing=user1"
}
Will give you:
{
"_index": "76349386",
"_id": "1",
"_version": 1,
"result": "created", <= Here it says the result of the operation was a creation
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
But then when you will play the second command
PUT 76349386/_doc/1?routing=user2
{
"title": "This is document number with routing=user2"
}
The response will look a little bit different:
{
"_index": "76349386",
"_id": "1",
"_version": 2,
"result": "updated", <= it is an update.
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}
The document with _id
1 has been updated.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论