Solr文本搜索在长查询时无法工作。

huangapple go评论54阅读模式
英文:

Solr text search not working with long queries

问题

我不是Solr的专家,我正在尝试检查其功能。

我遇到了一个奇怪的行为,如果我的文本搜索查询由最多3个单词组成,那么结果很好,但如果查询更长,就没有结果。

我做了以下操作:

  1. 创建了一个名为my_core的Solr容器:

    docker run -d -p 8983:8983 --name my_solr solr solr-precreate my_core

  2. 在仪表板上创建了一个新的字段,名为campo_teste,类型为text_pt,因为我需要索引一组葡萄牙文本。

  3. 使用pysolr添加并索引了我的语料库。

  4. 现在在查询时,当我搜索"subsídio de parentalidade"时,我得到了有意义的结果。

  5. 但如果我使用更长的句子,就没有结果。这是一个相同查询的示例,只是在更长的句子"quando posso pedir o subsídio de parentalidade?"中:

任何关于可能引起此问题的想法?

英文:

I'm not an expert in Solr and I'm trying it to check it's capabilities.

I'm having this odd behaviour where I'm getting good results if my text search query is composed of max 3 words, and zero results if the query is bigger.

What I did:

  1. Created a docker with solr and core named my_core:

    docker run -d -p 8983:8983 --name my_solr solr solr-precreate my_core

  2. In the dashboard created a new Field, named campo_teste, with the type text_pt because I need to index a dataset of Portuguese texts.

  3. Added and indexed my corpus with pysolr.

  4. Now at query time, when I search for "subsídio de parentalidade" I get results that make sense:

Solr文本搜索在长查询时无法工作。

  1. But if I use a longer sentences I get zero results. This is an example with the same query as before but in the longer sentence "quando posso pedir o subsídio de parentalidade?":

Solr文本搜索在长查询时无法工作。

Any ideas of what might be causing this issue?

答案1

得分: 2

你没有在所有数值中搜索相同的字段;在第一个示例中,你在 campo_text 字段中搜索 subsídio,在默认搜索字段中搜索 de parentalidade(因为你没有为这些值添加字段名称前缀)。

在你的第二个示例中,你在 campo_text 字段中搜索 quando,在默认搜索字段中搜索 posso pedir o subsídio de parentalidade(因为你没有添加前缀)。

实际上,subsídio 存在于 campo_text 中,而 quando 不在其中 - 默认搜索字段(默认为 _text_)可能没有内容,因此不会产生结果。

如果你想支持一般用户查询,通常更好的做法是使用带有 qf(查询字段)设置的 edismax 查询处理程序:

q=quanto posso pedir o subsídio de parentalidade&defType=edismax&qf=campo_text

这将使用所有单词搜索 campo_text。然后,你可以使用 q.op=ANDq.op=OR 来调整是否需要同时出现所有单词,或者你可以使用 mm(最小匹配)以更详细的方式调整配置文件。

英文:

You're not searching in the same field for all your values; in the first example you're searching for subsídio in the campo_text field and de parentalidade in the default search field (since you didn't prefix those values with a field name).

In your second example you're searching for quando in the campo_text field and posso pedir o subsídio de parentalidade in the default search field (since you're not prefixing those values).

In effect, subsídio is present in campo_text, while quando is not - the default search field (by default _text_) probably has no content, so no hits are produced.

If you want to support general user queries, it's usually a better idea to use the edismax query handler with the qf (query fields) setting:

q=quanto posso pedir o subsídio de parentalidade&defType=edismax&qf=campo_text

This will search campo_text using all the words. You can then use q.op=AND or q.op=OR to adjust whether all words needs to be present or not, or you can use mm (minimum match) to adjust the profile in a more detailed way.

答案2

得分: 1

问题出在Solr进行解析的方式上。MatsLindh的回答解释了Solr如何搜索字段中的单词。如果你想在字段中搜索例如campo_text中的文本:

我想要一个汉堡。

那么Solr内部的解析查询应该是

parsedquery: 'campo_text': 我 'campo_text':想要 'campo_text':一个 'campo_text':汉堡'

(当使用debug=all参数时,可以访问此类查询)

在我的端口,我尝试了MatsLindh提供的解决方案,但注意到使用defType = edismax会将查询转换为以下形式:

{'rawquerystring': '我想要一个汉堡',
  'querystring': '我想要一个汉堡',
  'parsedquery': '+(DisjunctionMaxQuery((text:我)) DisjunctionMaxQuery((text:想要)) DisjunctionMaxQuery((text:一个)) DisjunctionMaxQuery((text:汉堡)))',
  'parsedquery_toString': '+((text:我) (text:想要) (text:一个) (text:汉堡))'}

我的实现是在Python中,幸运的是有一个名为solrq的包,允许您使用Q类将文本解析到要搜索的字段中。在我的示例中,我使用了Q(text = '我想要一个汉堡')。调试相同的查询后,现在我得到:

{'rawquerystring': 'text:我\\想要\\一个\\汉堡',
  'querystring': 'text:我\\想要\\一个\\汉堡',
  'parsedquery': 'text:我 text:想要 text:一个 text:汉堡',
  'parsedquery_toString': 'text:我 text:想要 text:一个 text:汉堡'}

我已经在我正在进行的一个经验中测试了搜索查询的两种实现(defType = 'edismax'和使用Q解析器),在那里我正在查看在前k个检索到的文档中正确文档的准确性,并且在我的示例中使用Q解析器获得了更好的结果:

top_1 top_3 top_5 top_10
Q_parser_bm25 0.3054 0.4469 0.4988 0.5649
defType_edismax_bm25 0.2736 0.4009 0.4493 0.4988
英文:

The issue is due to the type of parsing Solr does. The answer from MatsLindh shares knowledge into how Solr searches for words in a field. If you want to search in a field for example campo_text the text:

I want a burguer.

Then the parsed query inside Solr should be

parsedquery: '`campo_text`: I `campo_text`:want `campo_text`:a `campo_text`:burguer'

(this type of query can be accessed when using the debug=all parameter)

On my end, I tried the solution provided by MatsLindh but noticed that using the defType = edismax turns the query to the following:

{'rawquerystring': 'I want a burguer',
  'querystring': 'I want a burguer',
  'parsedquery': '+(DisjunctionMaxQuery((text:i)) DisjunctionMaxQuery((text:want)) DisjunctionMaxQuery((text:a)) DisjunctionMaxQuery((text:burger)))',
  'parsedquery_toString': '+((text:i) (text:want) (text:a) (text:burguer))'}

My implementation is in Python and luckily there is a package named solrq which allows you to parse your text to the fields you want to search in using the Q Class. In my example I used Q(text = 'I want a burguer'). Debugging the same query I now get:

{'rawquerystring': 'text:I\\ want\\ a\\ burguer',
  'querystring': 'text:I\\ want\\ a\\ burguer',
  'parsedquery': 'text:i text:want text:a text:burguer',
  'parsedquery_toString': 'text:i text:want text:a text:burguer'}

I have tested both implementations of search queries (defType = 'edismax' and using the Q parser) on an experience I was working on where I'm looking at the accuracy of correct documents in the top k retrieved documents and I have obtained better results using the Q parser on my example:

top_1 top_3 top_5 top_10
Q_parser_bm25 0.3054 0.4469 0.4988 0.5649
defType_edismax_bm25 0.2736 0.4009 0.4493 0.4988

huangapple
  • 本文由 发表于 2023年5月30日 04:52:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76360271.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定