相同的ID具有不同的_routing值。

huangapple go评论67阅读模式
英文:

Same IDs with different _routing values

问题

根据 Elasticsearch 的文档,可以将具有相同 _id 的文档与不同的 _routing 值索引,因此文档指出 _id 的唯一性不被保证,因为这些文档可能会分布在不同的分片上(这似乎是一种特性而不是错误)。

那么当具有相同 _id 的两个文档以不同的 routing 值索引并且最终位于同一分片上的情况如何呢?考虑下面的查询体:

PUT test-index
{
  "settings": {
    "index": {
      "number_of_shards": 2
    }
  }
}

PUT test-index/_doc/1?routing=user1
{
  "title": "这是具有 routing=user1 的文档编号"
}

PUT test-index/_doc/1?routing=user2
{
  "title": "这是具有 routing=user2 的文档编号"
}

GET test-index/_search

搜索查询呈现以下结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 1,
        "_routing": "user2",
        "_source": {
          "title": "这是具有 routing=user2 的文档编号"
        }
      }
    ]
  }
}

为什么搜索响应只显示了具有 user2 的文档,尽管有 2 个分片?这是因为根据以下公式确定分片编号:

shard_num = (hash(_routing) % num_routing_shards) / routing_factor
其中 routing_factor = num_routing_shards / num_primary_shards

在我的情况下,routing_factor 为 1(即 2 个 routing 分片 / 2 个主分片),因此分片 ID 基本上是 _routing 值 mod 2 的哈希值。

使用您的 routing 值,我们得到以下分片 ID(可以在这里进行 murmur3 实验):

murmur3("user1") % 2 = 3305849917 % 2 = 分片 1
murmur3("user2") % 2 = 4180509323 % 2 = 分片 1

然而,如果具有相同 _id 但包含不同 _routing 值的两个文档最终位于同一分片上,为什么只显示一个文档呢?

英文:

According to elasticsearch documentation, it is possible to have docs with the same _id indexed with different _routing values. Hence, the documentation states that the uniqueness on _id is not guaranteed because these docs can end up on different shards (which appears to be a feature rather than a bug)

How about the scenario when two docs with the same _id indexed with different routing values end up on the same shard? Consider the query body below:

PUT test-index
{
  "settings": {
    "index": 
    {
      "number_of_shards": 2
      }
  }
}


PUT test-index/_doc/1?routing=user1
{
  "title": "This is document number with routing=user1"
}

PUT test-index/_doc/1?routing=user2
{
  "title": "This is document number with routing=user2"
}

GET test-index/_search

The search queries renders the following result:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 1,
        "_routing": "user2",
        "_source": {
          "title": "This is document number with routing=user2"
        }
      }
    ]
  }
}

Why does the search response only shows doc under user2 despite having 2 shards? It is certain that both docs ended up on the same shard because as per formula:

shard_num = (hash(_routing) % num_routing_shards) / routing_factor
where routing_factor = num_routing_shards / num_primary_shards

In my case routing_factor is 1 (i.e. 2 routing shards / 2 primary shards), so the shard ID is basically the hash of the _routing value mod 2.

Using your routing values, we get the following shard IDs (we can experiment murmur3 here):

murmur3("user1") % 2 = 3305849917 % 2 = shard 1
murmur3("user2") % 2 = 4180509323 % 2 = shard 1

However, if both docs with same _id containing different _routing values end up on the same shard, why does it only show one doc?

答案1

得分: 1

因为它们在相同的分片上具有相同的ID,所以第二个查询不是“插入”,而是更新。

证据:

如果按以下顺序执行以下命令:

PUT 76349386
{
  "settings": {
    "index": 
    {
      "number_of_shards": 2
      }
  }
}

然后

PUT 76349386/_doc/1?routing=user1
{
  "title": "这是具有routing=user1的文档编号"
}

将给你:

{
  "_index": "76349386",
  "_id": "1",
  "_version": 1,
  "result": "created",  <= 这里表示操作的结果是创建
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

但是当你执行第二个命令时

PUT 76349386/_doc/1?routing=user2
{
  "title": "这是具有routing=user2的文档编号"
}

响应将略有不同:

{
  "_index": "76349386",
  "_id": "1",
  "_version": 2,
  "result": "updated", <= 这是一个更新。
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

具有ID 1的文档已被更新。

英文:

Tldr

Because they share the same ID on the same shard, the second query is not an insert, it is an update.

Evidences:

If you play the following commands in order:

PUT 76349386
{
  &quot;settings&quot;: {
    &quot;index&quot;: 
    {
      &quot;number_of_shards&quot;: 2
      }
  }
}

Then

PUT 76349386/_doc/1?routing=user1
{
  &quot;title&quot;: &quot;This is document number with routing=user1&quot;
}

Will give you:

{
  &quot;_index&quot;: &quot;76349386&quot;,
  &quot;_id&quot;: &quot;1&quot;,
  &quot;_version&quot;: 1,
  &quot;result&quot;: &quot;created&quot;,  &lt;= Here it says the result of the operation was a creation
  &quot;_shards&quot;: {
    &quot;total&quot;: 2,
    &quot;successful&quot;: 1,
    &quot;failed&quot;: 0
  },
  &quot;_seq_no&quot;: 0,
  &quot;_primary_term&quot;: 1
}

But then when you will play the second command

PUT 76349386/_doc/1?routing=user2
{
  &quot;title&quot;: &quot;This is document number with routing=user2&quot;
}

The response will look a little bit different:

{
  &quot;_index&quot;: &quot;76349386&quot;,
  &quot;_id&quot;: &quot;1&quot;,
  &quot;_version&quot;: 2,
  &quot;result&quot;: &quot;updated&quot;, &lt;= it is an update.
  &quot;_shards&quot;: {
    &quot;total&quot;: 2,
    &quot;successful&quot;: 1,
    &quot;failed&quot;: 0
  },
  &quot;_seq_no&quot;: 1,
  &quot;_primary_term&quot;: 1
}

The document with _id 1 has been updated.

huangapple
  • 本文由 发表于 2023年5月28日 07:15:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349386.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定