部分文本 – 格式化日期匹配与Elasticsearch正则表达式

huangapple go评论60阅读模式
英文:

Partial text-formatted date match with Elasticsearch regexp

问题

I formatted a date field dob as text dob.strftime("%m/%d/%Y") and stored these dates on Elasticsearch 8.7.1 ("lucene_version": "9.5.0") so I could utilize regexp to do partial date matching.

Suppose this date is stored on Elasticsearch: 06/01/2023, I noticed that I was only able to get this result back when using these regexp queries:

  • 06.*
  • .*01.*
  • .*2023

However, using / or \/ or \\/ in the query couldn't get back any result. Double checked on Elasticsearch's doc, / is NOT a reserved character.

I have a few questions, help would be much appreciated:

  1. Why is / not working as a part of the regexp query?
  2. How to correctly formulate the regexp query? I wish I could find matches after typing a term that's matching any of the following format:
- M/d
- M/d/YY
- M/d/YYYY
- M/dd
- M/dd/YY
- M/dd/YYYY
- M/YY
- M/YYYY
- MM/d
- MM/d/YY
- MM/d/YYYY
- MM/dd
- MM/dd/YY
- MM/dd/YYYY
- MM/YY
- MM/YYYY
  1. Does any other types of search work better than regexp?
英文:

I formatted a date field dob as text dob.strftime("%m/%d/%Y") and stored these dates on Elasticsearch 8.7.1 ("lucene_version": "9.5.0") so I could utilize regexp to do partial date matching.

Suppose this date is stored on Elasticsearch: 06/01/2023, I noticed that I was only able to get this result back when using these regexp queries:

  • 06.*
  • .*01.*
  • .*2023

However, using / or \/ or \\/ in the query couldn't get back any result. Double checked on Elasticsearch's doc, / is NOT a reserved character.

I have a few questions, help would be much appreciated:

  1. Why is / not working as a part of the regexp query?
  2. How to correctly formulate the regexp query? I wish I could find matches after typing a term that's matching any of the following format:
- M/d
- M/d/YY
- M/d/YYYY
- M/dd
- M/dd/YY
- M/dd/YYYY
- M/YY
- M/YYYY
- MM/d
- MM/d/YY
- MM/d/YYYY
- MM/dd
- MM/dd/YY
- MM/dd/YYYY
- MM/YY
- MM/YYYY
  1. Does any other types of search work better than regexp?

答案1

得分: 0

根据这份官方文档/确实是一个保留字符。当在请求体中使用JSON时,需要两个前置反斜杠(\),因为反斜杠在JSON字符串中是一个保留的转义字符。

regexp查询在文本字段和关键字字段上的工作方式不同。Elasticsearch在应用正则表达式之前会分析字段。文本字段会被分词为单独的单词,因此使用/无法找到任何匹配项。

相反,整个关键字字段字符串被视为单个且非分析的字符串(参见关键字分析器)。在我改用关键字字段后,可以使用/和正则表达式进行搜索:

  • dob = fields.TextField(fields={"raw": fields.KeywordField()})
"mappings": {
  "properties": {
    "dob": {
      "type": "text",
      "fields": {
        "raw": {
          "type": "keyword"
        }
      }
    }
  }
}
{
  "query": {
    "regexp": {
      "dob.raw": ".*6\\/.*2023.*"
    }
  }
}
英文:

According to this official doc, / is indeed a reserved character. When using JSON for the request body, two preceding backslashes (\) are required since the backslash is a reserved escaping character in JSON strings.

regexp query works differently for text and keyword fields. Elasticsearch analyzes fields before applying regex. Text fields are tokenized into individual words so using / couldn't find any match.

Instead, the entire keyword field string is treated as a single and non-analyzed string (see Keyword analyzer). Searching with / and regexp worked after I used a keyword field instead:

  • dob = fields.TextField(fields={"raw": fields.KeywordField()})
"mappings": {
  "properties": {
    "dob": {
      "type": "text",
      "fields": {
        "raw": {
          "type": "keyword"
        }
      }
    }
 }
{
  "query": {
    "regexp": {
      "dob.raw": ".*6\\/.*2023.*"
    }
  }
}

huangapple
  • 本文由 发表于 2023年6月2日 05:14:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385754.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定