英文:
Elasticsearch Multi-term aggregations to retrieve duplicates
问题
在我的Elasticsearch索引中,我有一些具有相同值的“唯一字段”的重复文档。
为了修复它们,我需要找到它们,所以我正在使用一个聚合查询,其中使用了min_doc_count=2
。问题是我只能用一个键来运行它,而不能用一对键。所以这样可以工作:
GET /my_index/_search
{
"size": 0,
"aggs": {
"receipts": {
"terms": {
"field": "key1",
"min_doc_count": 2,
"size": 1000000
}
}
}
}
我想要同时匹配两个字段,但如何插入一个“double”字段“key2”?
有任何想法吗?
我尝试使用多字段聚合,类似这样(我不知道语法是否正确):
GET /my_index/_search
{
"size": 0,
"aggs": {
"receipts": {
"multi_terms": {
"terms": [
{
"field": "key1"
},
{
"field": "key2"
}
],
"min_doc_count": 2,
"size": 1000000
}
}
}
}
但我得到了这个错误:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
"line" : 5,
"col" : 26
}
],
"type" : "parsing_exception",
"reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
"line" : 5,
"col" : 26,
"caused_by" : {
"type" : "named_object_not_found_exception",
"reason" : "[5:26] unknown field [multi_terms]"
}
},
"status" : 400
}
英文:
In my Elasticsearch index I have duplicates docs where some "unique" fields have the same values.
In order to fix them, I have to find them, so I'm using an aggregation query with min_doc_count=2
. The problem is that I manage to run it only with one key and not with a couple of keys. So in this way it works:
GET /my_index/_search
{
"size": 0,
"aggs": {
"receipts": {
"terms": {
"field": "key1",
"min_doc_count": 2,
"size": 1000000
}
}
}
}
I'd like to have **two terms that simultaneously match, but how to insert a double field
key2
?
Any idea?
I tried with multi-terms aggregations, like this (I don't know if the syntax is correct):
GET /my_index/_search
{
"size": 0,
"aggs": {
"receipts": {
"multi_terms": {
"terms": [
{
"field": "key1"
},
{
"field": "key2"
}
],
"min_doc_count": 2,
"size": 1000000
}
}
}
}
but I get this errror:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
"line" : 5,
"col" : 26
}
],
"type" : "parsing_exception",
"reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
"line" : 5,
"col" : 26,
"caused_by" : {
"type" : "named_object_not_found_exception",
"reason" : "[5:26] unknown field [multi_terms]"
}
},
"status" : 400
}
答案1
得分: 1
Elasticsearch子聚合可以解决您的问题。
GET /my_index/_search
{
"size": 0,
"aggs": {
"receipts": {
"terms": {
"field": "key1",
"min_doc_count": 2,
"size": 1000000
},
"aggs": {
"NAME": {
"terms": {
"field": "key2",
"min_doc_count": 2,
"size": 1000000
}
}
}
}
}
}
英文:
Elasticsearch sub-aggregation can solve your issue.
GET /my_index/_search
{
"size": 0,
"aggs": {
"receipts": {
"terms": {
"field": "key1",
"min_doc_count": 2,
"size": 1000000
},
"aggs": {
"NAME": {
"terms": {
"field": "key2",
"min_doc_count": 2,
"size": 1000000
}
}
}
}
}
}
答案2
得分: 1
你也可以使用脚本来执行这个操作:
GET /docs/_search
{
"size": 0,
"aggs": {
"receipts": {
"terms": {
"script": "doc['key1'].value + '_' + doc['key2'].value",
"min_doc_count": 2,
"size": 1000000
}
}
}
}
但需要注意,与terms查询相比,这里可能会出现性能问题。
这里还有一些示例文档:
POST docs/_doc
{
"key1": 1,
"key2": 2
}
POST docs/_doc
{
"key1": 1,
"key2": 2
}
POST docs/_doc
{
"key1": 2,
"key2": 1
}
以及上面查询的结果:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"receipts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1_2",
"doc_count": 2
}
]
}
}
}
英文:
You can use script also to do this :
GET /docs/_search
{
"size": 0,
"aggs": {
"receipts": {
"terms": {
"script": "doc['key1'].value + '_' + doc['key2'].value",
"min_doc_count": 2,
"size": 1000000
}
}
}
}
But you need to know that there can be performance issues here when we compare with terms query.
Here also some sample documents :
POST docs/_doc
{
"key1": 1,
"key2": 2
}
POST docs/_doc
{
"key1": 1,
"key2": 2
}
POST docs/_doc
{
"key1": 2,
"key2": 1
}
and the result of the query above :
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"receipts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1_2",
"doc_count": 2
}
]
}
}
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论