在Elasticsearch中存储和查询嵌套数据的理想结构是什么?

huangapple go评论64阅读模式
英文:

What is the ideal structure for storing and querying nested data in elasticsearch?

问题

我在Elasticsearch上运行一个使用function_score的查询,其中一个嵌套字段的值用于计算(在这种情况下是价格)。以下哪种方式更好地索引数据?

{
  "name": "John",
  "age": 50,
  "country": "US",
  "subscription": {
    "Plan1": {
      "price": 100,
      "date": "June 5th"
    },
    "Plan2": {
      "price": 50,
      "date": "June 6th"
    }
  }
}

或者

{
  "name": "John",
  "age": 50,
  "country": "US",
  "subscription": [
    {
      "name": "Plan1",
      "price": 100,
      "date": "June 5th"
    },
    {
      "name": "Plan2",
      "price": 50,
      "date": "June 6th"
    }
  ]
}

查询将在"计划名称"和"价格"上进行过滤,"价格"将用于评分计算。计划的数量可能会超过20个。

编辑1:方法2的示例查询

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "createdatutc": {
                  "gte": "2022-11-01T00:00:00.000Z",
                  "lt": "2023-05-06T00:00:00.000Z",
                  "format": "strict_date_optional_time"
                }
              }
            },
            {
              "terms": {
                "country": [
                  "US"
                ]
              }
            },
            {
              "term": {
                "subscription.name": {
                  "value": "Plan1"
                }
              }
            }
          ]
        }
      },
      "functions": [
        {
          "filter": {
            "query_string": {
              "default_field": "name",
              "query": "\"john\""
            }
          },
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "for (item in params._source.subscription) {if (item.name == 'Plan1') {return item.price}}"
            }
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  }
}
英文:

I'm running a query on elasticsearch using function_score where the value of a nested field is used for the calculation (price in this case). Which of the following is a better way to index the data?

{
      "name": "John",
      "age": 50,
      "country": "US",
      "subscription": {
        "Plan1": {
          "price": 100,
          "date": "June 5th"
        },
        "Plan2": {
          "price": 50,
          "date": "June 6th"
        }
      }
    }

OR

{
  "name": "John",
  "age": 50,
  "country": "US",
  "subscription": [
    {
      "name": "Plan1",
      "price": 100,
      "date": "June 5th"
    },
    {
      "name": "Plan2"
      "price": 50,
      "date": "June 6th"
    }
  ]
}

The query would be filtering on the "plan name" and "price", and "price" will be used for score calculation. The number of plans maybe upwards of 20.

Edit 1: Sample query for approach 2

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "createdatutc": {
                  "gte": "2022-11-01T00:00:00.000Z",
                  "lt": "2023-05-06T00:00:00.000Z",
                  "format": "strict_date_optional_time"
                }
              }
            },
            {
              "terms": {
                "country": [
                  "US"
                ]
              }
            },
            {
              "term": {
                "subscription.name": {
                  "value": "Plan1"
                } 
              }
            }
          ]
        }
      },
      "functions": [
        {
          "filter": {
            "query_string": {
              "default_field": "name",
              "query": "\"john\""
            }
          },
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "for (item in params._source.subscription) {if (item.name == 'Plan1') {return item.price}}"
            }
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  }
}

答案1

得分: 1

根据你有多少个计划而定。如果只有两个或者少数几个,那么第一选项更好;否则,你需要将 subscription 设计为嵌套对象,但嵌套对象在查询性能方面不够优化。

使用第一选项,可以通过单一条件 subscription.Plan1.price: 100 进行计划名称和价格的筛选,而使用第二选项则需要两个条件(因此 subscription 需要是 nested),一个是 subscription.name: Plan1,另一个是 subscription.price: 100

更新 1:使用选项 1

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "createdatutc": {
                  "gte": "2022-11-01T00:00:00.000Z",
                  "lt": "2023-05-06T00:00:00.000Z",
                  "format": "strict_date_optional_time"
                }
              }
            },
            {
              "terms": {
                "country": [
                  "US"
                ]
              }
            },
            {
              "exists": {
                "field": "subscription.Plan1.price"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "filter": {
            "query_string": {
              "default_field": "name",
              "query": "\"john\""
            }
          },
          "field_value_factor": {
            "field": "subscription.Plan1.price",
            "factor": 1.2
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  }
}
英文:

It depends on how many plans you have. If it's just two or a handful then the first option is better, otherwise you need to make subscription a nested object and nested object are less optimal in terms of query performance.

With the first option, filtering on the plan name and price can be done with a single condition on subscription.Plan1.price: 100, while with the second option, you need two conditions (hence why subscription needs to be nested), one on subscription.name: Plan1 and another on subscription.price: 100

UPDATE 1: using option 1

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "createdatutc": {
                  "gte": "2022-11-01T00:00:00.000Z",
                  "lt": "2023-05-06T00:00:00.000Z",
                  "format": "strict_date_optional_time"
                }
              }
            },
            {
              "terms": {
                "country": [
                  "US"
                ]
              }
            },
            {
              "exists": {
                "field": "subscription.Plan1.price"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "filter": {
            "query_string": {
              "default_field": "name",
              "query": "\"john\""
            }
          },
          "field_value_factor": {
            "field": "subscription.Plan1.price",
            "factor": 1.2
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  }
}

huangapple
  • 本文由 发表于 2023年6月26日 15:31:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76554446.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定