SpanNot Lucene Query 要么太严格,要么太宽松。

huangapple go评论74阅读模式
英文:

SpanNot Lucene Query being either too strict or too permissive

问题

Here's the translated content you requested:

给定两个文档,每个文档有两个字段:

1. 标题:英国要求联合国会议
   内容:联合国将听取英国的声明(...)

2. 标题:航空公司在全国范围内受到审查
   内容:美国航空公司联合航空公司面临越来越多的(...)

我需要一个Lucene查询,该查询应该:
A) 匹配单词“united”的实例,但不是在其后紧跟“States”或“Kingdom”,无论是在标题还是内容字段中。
B) 更重要的是,尽管它们包含了既期望的短语又包含了不期望的短语,但要匹配这两个文档。

我的首选是使用`spanNot()`,它应该接受两个`spanTerm`查询,包括和排除的顺序,然后是一个`dist`整数和一个指示是否按顺序排列的布尔值。例如:
```plaintext
spanNot(title:united, title:states, 1, true)

鉴于此,我已经使用BooleanQuery链接了必要的查询,使查询如下:

(+spanNot(title:united, title:states, 1, true) +spanNot(title:united, title:kingdom, 1, true))
(+spanNot(content:united, content:states, 1, true) +spanNot(content:united, content:kingdom, 1, true))

正如您所看到的,上面有两组查询,它们在逻辑上应该如下读取:“(标题必须包含“united”但不是“united states”,且标题必须包含“united”但不是“united kingdom”)或(内容必须包含“united”但不是“united states”,且内容必须包含“united”但不是“united kingdom”)”。

从概念上来说,这对我来说是完全合理的,但我发现我的查询结果 - 无论是最初的spanNot还是更长的链式BooleanQuery版本 - 都是不正确的。要么整个文档都不匹配,要么每次出现“united”都会匹配 - 我非常困扰,不知道原因是什么。

额外的一些细节:
我正在使用Clojure中的Lucene Java库来实现查询构建器,但是使用Kibana的Lucene查询功能测试这些查询,而这些文档绝对应该匹配。
使用的Lucene版本为7.7 - 升级可能是一个选择,但我认为这不会解决我的问题。

非常感谢任何见解。


Please note that code-related parts have been omitted as per your request. If you have any specific questions or need further assistance with this issue, feel free to ask.

<details>
<summary>英文:</summary>

Given two documents with two fields each:

  1. title: United Kingdom requested meeting of United Nations
    content: The United Nations will hear statements from the United Kingdom (...)

  2. title: Airlines face scrutiny across nation
    content: United States airline United Airlines has faced increasing (...)


I&#39;m after a Lucene query which will
A) Match instances of the word &quot;united&quot;, but NOT when followed by either &quot;States&quot; or &quot;Kingdom&quot;, in either the title OR the content field
B) Importantly, match both documents even though they contain both a desired and an undesired phrase.

My first port of call has been `spanNot()`, which is meant to take two `spanTerm` queries in an include, exclude order, followed by a `dist` integer, and a boolean indicating whether the terms should be in order. Eg:

spanNot(title:united, title:states, 1, true)

Given this, I&#39;ve chained the necessary queries using a `BooleanQuery` so that the query is this:

(+spanNot(title:united, title:states, 1, true) +spanNot(title:united, title:kingdom, 1, true))
(+spanNot(content:united, content:states, 1, true) +spanNot(content:united, content:kingdom, 1, true))

As you can see, there are two groupings of queries above, which should read logically like this: &quot;(Title must contain united BUT NOT united states, AND title must contain united BUT NOT united kingdom) OR (Content must contain united BUT NOT united states, AND content must contain united BUT NOT united kingdom)&quot;

Conceptually this makes perfect sense to me, however, I&#39;m finding that the results of my query - either the initial `spanNot` or the longer chained `BooleanQuery` version - are incorrect. Either the entire document is not matched, or each mention of the word &quot;united&quot; is matched - having immense trouble working out the reason why.

For some additional detail:
I&#39;m implementing the query builder using the lucene java library in Clojure, but testing out the queries using Kibana&#39;s Lucene querying feature, over documents that absolutely should match.
Using Lucene v 7.7 - an upgrade is probably on the cards, but I do not believe this would solve my problem.

Any insight would be tremendously appreciated.


</details>


# 答案1
**得分**: 1

这是在仔细查阅Lucene文档和源代码调试后修复的。以下是在Lucene中编写此查询的正确方式:

spanNot(title:united, spanOr([spanNear([title:united, title:states], 0, true), spanNear([title:united, title:kingdom], 0, true)]), 0, 0) spanNot(content:united, spanOr([spanNear([content:united, content:states], 0, true), spanNear([content:united, content:kingdom], 0, true)]), 0, 0) spanNot(summary:united, spanOr([spanNear([summary:united, summary:states], 0, true), spanNear([summary:united, summary:kingdom], 0, true)]), 0, 0)


如果阅读起来有困难,这是3个单独的查询(每个字段一个),由一个包含术语查询的`spanNot`和一个排除`spanOr`组成,`spanOr`本身由两个`spanNear`查询组成 - 每个排除术语一个。

之前的问题是排除术语和字段的组合太多,难以确定应该和必须的分布。执行此搜索的正确方式是每个字段进行一次彻底的查询。

<details>
<summary>英文:</summary>

This was fixed after much trawling through Lucene documents and source code debugging. Here is the right way to write this query in Lucene:

spanNot(title:united, spanOr([spanNear([title:united, title:states], 0, true), spanNear([title:united, title:kingdom], 0, true)]), 0, 0) spanNot(content:united, spanOr([spanNear([content:united, content:states], 0, true), spanNear([content:united, content:kingdom], 0, true)]), 0, 0) spanNot(summary:united, spanOr([spanNear([summary:united, summary:states], 0, true), spanNear([summary:united, summary:kingdom], 0, true)]), 0, 0)


In case that&#39;s difficult to read, it&#39;s 3 separate queries (one for each field) made up of a `spanNot` with a term query include, and a `spanOr` exclude, which itself is comprised of two `spanNear` queries - one for each exlcusion term. 

The issue before was that there were too many combinations of exclusion terms and fields for any distribution of SHOULD and MUST. The right way to execute this search was one thorough query per field.

</details>



# 答案2
**得分**: 0

我认为你应该使用布尔 + 短语查询。 我不知道库中的Lucene语法(我认为你需要PhraseQuery),但是使用常规请求,你可以使用以下查询:

```json
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "filter": [
              {
                "match": {
                  "title": "united"
                }
              }
            ],
            "must_not": [
              {
                "match_phrase": {
                  "title": "United Kingdom"
                }
              },
              {
                "match_phrase": {
                  "title": "United States"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "filter": [
              {
                "match": {
                  "content": "united"
                }
              }
            ],
            "must_not": [
              {
                "match_phrase": {
                  "content": "United Kingdom"
                }
              },
              {
                "match_phrase": {
                  "content": "United States"
                }
              }
            ]
          }
        }
      ]
    }
  }
}
英文:

I think, that you are supposed to use Boolean + Phrase query.
I don't know the Lucene Syntax from the library (I think that you need PhraseQuery), but with the regular request, you can use the following query:

{
  &quot;query&quot;: {
    &quot;bool&quot;: {
      &quot;should&quot;: [
        {
          &quot;bool&quot;: {
            &quot;filter&quot;: [
              {
                &quot;match&quot;: {
                  &quot;title&quot;: &quot;united&quot;
                }
              }
            ],
            &quot;must_not&quot;: [
              {
                &quot;match_phrase&quot;: {
                  &quot;title&quot;: &quot;United Kingdom&quot;
                }
              },
              {
                &quot;match_phrase&quot;: {
                  &quot;title&quot;: &quot;United States&quot;
                }
              }
            ]
          }
        },
        {
          &quot;bool&quot;: {
            &quot;filter&quot;: [
              {
                &quot;match&quot;: {
                  &quot;content&quot;: &quot;united&quot;
                }
              }
            ],
            &quot;must_not&quot;: [
              {
                &quot;match_phrase&quot;: {
                  &quot;content&quot;: &quot;United Kingdom&quot;
                }
              },
              {
                &quot;match_phrase&quot;: {
                  &quot;content&quot;: &quot;United States&quot;
                }
              }
            ]
          }
        }
      ]
    }
  }
}

huangapple
  • 本文由 发表于 2023年5月11日 00:31:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220733.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定