部分文本 – 格式化日期匹配与Elasticsearch正则表达式

huangapple go评论88阅读模式
英文:

Partial text-formatted date match with Elasticsearch regexp

问题

I formatted a date field dob as text dob.strftime("%m/%d/%Y") and stored these dates on Elasticsearch 8.7.1 ("lucene_version": "9.5.0") so I could utilize regexp to do partial date matching.

Suppose this date is stored on Elasticsearch: 06/01/2023, I noticed that I was only able to get this result back when using these regexp queries:

  • 06.*
  • .*01.*
  • .*2023

However, using / or \/ or \\/ in the query couldn't get back any result. Double checked on Elasticsearch's doc, / is NOT a reserved character.

I have a few questions, help would be much appreciated:

  1. Why is / not working as a part of the regexp query?
  2. How to correctly formulate the regexp query? I wish I could find matches after typing a term that's matching any of the following format:
  1. - M/d
  2. - M/d/YY
  3. - M/d/YYYY
  4. - M/dd
  5. - M/dd/YY
  6. - M/dd/YYYY
  7. - M/YY
  8. - M/YYYY
  9. - MM/d
  10. - MM/d/YY
  11. - MM/d/YYYY
  12. - MM/dd
  13. - MM/dd/YY
  14. - MM/dd/YYYY
  15. - MM/YY
  16. - MM/YYYY
  1. Does any other types of search work better than regexp?
英文:

I formatted a date field dob as text dob.strftime("%m/%d/%Y") and stored these dates on Elasticsearch 8.7.1 ("lucene_version": "9.5.0") so I could utilize regexp to do partial date matching.

Suppose this date is stored on Elasticsearch: 06/01/2023, I noticed that I was only able to get this result back when using these regexp queries:

  • 06.*
  • .*01.*
  • .*2023

However, using / or \/ or \\/ in the query couldn't get back any result. Double checked on Elasticsearch's doc, / is NOT a reserved character.

I have a few questions, help would be much appreciated:

  1. Why is / not working as a part of the regexp query?
  2. How to correctly formulate the regexp query? I wish I could find matches after typing a term that's matching any of the following format:
  1. - M/d
  2. - M/d/YY
  3. - M/d/YYYY
  4. - M/dd
  5. - M/dd/YY
  6. - M/dd/YYYY
  7. - M/YY
  8. - M/YYYY
  9. - MM/d
  10. - MM/d/YY
  11. - MM/d/YYYY
  12. - MM/dd
  13. - MM/dd/YY
  14. - MM/dd/YYYY
  15. - MM/YY
  16. - MM/YYYY
  1. Does any other types of search work better than regexp?

答案1

得分: 0

根据这份官方文档/确实是一个保留字符。当在请求体中使用JSON时,需要两个前置反斜杠(\),因为反斜杠在JSON字符串中是一个保留的转义字符。

regexp查询在文本字段和关键字字段上的工作方式不同。Elasticsearch在应用正则表达式之前会分析字段。文本字段会被分词为单独的单词,因此使用/无法找到任何匹配项。

相反,整个关键字字段字符串被视为单个且非分析的字符串(参见关键字分析器)。在我改用关键字字段后,可以使用/和正则表达式进行搜索:

  • dob = fields.TextField(fields={"raw": fields.KeywordField()})
  1. "mappings": {
  2. "properties": {
  3. "dob": {
  4. "type": "text",
  5. "fields": {
  6. "raw": {
  7. "type": "keyword"
  8. }
  9. }
  10. }
  11. }
  12. }
  1. {
  2. "query": {
  3. "regexp": {
  4. "dob.raw": ".*6\\/.*2023.*"
  5. }
  6. }
  7. }
英文:

According to this official doc, / is indeed a reserved character. When using JSON for the request body, two preceding backslashes (\) are required since the backslash is a reserved escaping character in JSON strings.

regexp query works differently for text and keyword fields. Elasticsearch analyzes fields before applying regex. Text fields are tokenized into individual words so using / couldn't find any match.

Instead, the entire keyword field string is treated as a single and non-analyzed string (see Keyword analyzer). Searching with / and regexp worked after I used a keyword field instead:

  • dob = fields.TextField(fields={"raw": fields.KeywordField()})
  1. "mappings": {
  2. "properties": {
  3. "dob": {
  4. "type": "text",
  5. "fields": {
  6. "raw": {
  7. "type": "keyword"
  8. }
  9. }
  10. }
  11. }
  1. {
  2. "query": {
  3. "regexp": {
  4. "dob.raw": ".*6\\/.*2023.*"
  5. }
  6. }
  7. }

huangapple
  • 本文由 发表于 2023年6月2日 05:14:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385754.html
  • elasticsearch
  • elasticsearch-dsl
  • elasticsearch-dsl-py
  • elasticsearch-py
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定