Elasticsearch多词项聚合以检索重复项。

huangapple go评论121阅读模式
英文:

Elasticsearch Multi-term aggregations to retrieve duplicates

问题

在我的Elasticsearch索引中,我有一些具有相同值的“唯一字段”的重复文档。

为了修复它们,我需要找到它们,所以我正在使用一个聚合查询,其中使用了min_doc_count=2。问题是我只能用一个键来运行它,而不能用一对键。所以这样可以工作:

GET /my_index/_search
{
   "size": 0,
   "aggs": {
      "receipts": {
         "terms": {
            "field": "key1",
            "min_doc_count": 2,
            "size": 1000000
          }
      }
  }
}

我想要同时匹配两个字段,但如何插入一个“double”字段“key2”?

有任何想法吗?

我尝试使用多字段聚合,类似这样(我不知道语法是否正确):

GET /my_index/_search
{
   "size": 0,
   "aggs": {
      "receipts": {
          "multi_terms": {
            "terms": [
              {
                "field": "key1" 
              }, 
              {
                "field": "key2"
              }
            ],
            "min_doc_count": 2,
            "size": 1000000
       }
   }
  }
}

但我得到了这个错误:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
        "line" : 5,
        "col" : 26
      }
    ],
    "type" : "parsing_exception",
    "reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
    "line" : 5,
    "col" : 26,
    "caused_by" : {
      "type" : "named_object_not_found_exception",
      "reason" : "[5:26] unknown field [multi_terms]"
    }
  },
  "status" : 400
}
英文:

In my Elasticsearch index I have duplicates docs where some "unique" fields have the same values.

In order to fix them, I have to find them, so I'm using an aggregation query with min_doc_count=2. The problem is that I manage to run it only with one key and not with a couple of keys. So in this way it works:

GET /my_index/_search
{
   "size": 0,
   "aggs": {
      "receipts": {
         "terms": {
            "field": "key1",
            "min_doc_count": 2,
            "size": 1000000
          }
      }
  }
}

I'd like to have **two terms that simultaneously match, but how to insert a double field key2?

Any idea?

I tried with multi-terms aggregations, like this (I don't know if the syntax is correct):

GET /my_index/_search
{
   "size": 0,
   "aggs": {
      "receipts": {
          "multi_terms": {
            "terms": [
              {
                "field": "key1" 
              }, 
              {
                "field": "key2"
              }
            ],
            "min_doc_count": 2,
            "size": 1000000
       }
   }
  }
}

but I get this errror:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
        "line" : 5,
        "col" : 26
      }
    ],
    "type" : "parsing_exception",
    "reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
    "line" : 5,
    "col" : 26,
    "caused_by" : {
      "type" : "named_object_not_found_exception",
      "reason" : "[5:26] unknown field [multi_terms]"
    }
  },
  "status" : 400
}

答案1

得分: 1

Elasticsearch子聚合可以解决您的问题。

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "receipts": {
      "terms": {
        "field": "key1",
        "min_doc_count": 2,
        "size": 1000000
      },
      "aggs": {
        "NAME": {
          "terms": {
            "field": "key2",
            "min_doc_count": 2,
            "size": 1000000
          }
        }
      }
    }
  }
}
英文:

Elasticsearch sub-aggregation can solve your issue.

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "receipts": {
      "terms": {
        "field": "key1",
        "min_doc_count": 2,
        "size": 1000000
      },
      "aggs": {
        "NAME": {
          "terms": {
            "field": "key2",
            "min_doc_count": 2,
            "size": 1000000
          }
        }
      }
    }
  }
}

答案2

得分: 1

你也可以使用脚本来执行这个操作:

GET /docs/_search
{
  "size": 0,
  "aggs": {
    "receipts": {
      "terms": {
        "script": "doc['key1'].value + '_' + doc['key2'].value",
        "min_doc_count": 2,
        "size": 1000000
      }
    }
  }
}

但需要注意,与terms查询相比,这里可能会出现性能问题。

这里还有一些示例文档:

POST docs/_doc
{
  "key1": 1,
  "key2": 2
}
POST docs/_doc
{
  "key1": 1,
  "key2": 2
}
POST docs/_doc
{
  "key1": 2,
  "key2": 1
}

以及上面查询的结果:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "receipts": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "1_2",
          "doc_count": 2
        }
      ]
    }
  }
}
英文:

You can use script also to do this :

GET /docs/_search
{
  "size": 0,
  "aggs": {
    "receipts": {
      "terms": {
        "script": "doc['key1'].value + '_' + doc['key2'].value",
        "min_doc_count": 2,
        "size": 1000000
      }
    }
  }
}

But you need to know that there can be performance issues here when we compare with terms query.

Here also some sample documents :

POST docs/_doc
{
  "key1": 1,
  "key2": 2
}
POST docs/_doc
{
  "key1": 1,
  "key2": 2
}
POST docs/_doc
{
  "key1": 2,
  "key2": 1
}

and the result of the query above :

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "receipts": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "1_2",
          "doc_count": 2
        }
      ]
    }
  }
}

huangapple
  • 本文由 发表于 2023年3月3日 19:57:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626774.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定