英文:
Solr text search not working with long queries
问题
我不是Solr的专家,我正在尝试检查其功能。
我遇到了一个奇怪的行为,如果我的文本搜索查询由最多3个单词组成,那么结果很好,但如果查询更长,就没有结果。
我做了以下操作:
-
创建了一个名为
my_core
的Solr容器:docker run -d -p 8983:8983 --name my_solr solr solr-precreate my_core
-
在仪表板上创建了一个新的字段,名为
campo_teste
,类型为text_pt
,因为我需要索引一组葡萄牙文本。 -
使用
pysolr
添加并索引了我的语料库。 -
现在在查询时,当我搜索"subsídio de parentalidade"时,我得到了有意义的结果。
-
但如果我使用更长的句子,就没有结果。这是一个相同查询的示例,只是在更长的句子"quando posso pedir o subsídio de parentalidade?"中:
任何关于可能引起此问题的想法?
英文:
I'm not an expert in Solr and I'm trying it to check it's capabilities.
I'm having this odd behaviour where I'm getting good results if my text search query is composed of max 3 words, and zero results if the query is bigger.
What I did:
-
Created a docker with solr and core named
my_core
:docker run -d -p 8983:8983 --name my_solr solr solr-precreate my_core
-
In the dashboard created a new Field, named
campo_teste
, with the typetext_pt
because I need to index a dataset of Portuguese texts. -
Added and indexed my corpus with
pysolr
. -
Now at query time, when I search for "subsídio de parentalidade" I get results that make sense:
- But if I use a longer sentences I get zero results. This is an example with the same query as before but in the longer sentence "quando posso pedir o subsídio de parentalidade?":
Any ideas of what might be causing this issue?
答案1
得分: 2
你没有在所有数值中搜索相同的字段;在第一个示例中,你在 campo_text
字段中搜索 subsídio
,在默认搜索字段中搜索 de parentalidade
(因为你没有为这些值添加字段名称前缀)。
在你的第二个示例中,你在 campo_text
字段中搜索 quando
,在默认搜索字段中搜索 posso pedir o subsídio de parentalidade
(因为你没有添加前缀)。
实际上,subsídio
存在于 campo_text
中,而 quando
不在其中 - 默认搜索字段(默认为 _text_
)可能没有内容,因此不会产生结果。
如果你想支持一般用户查询,通常更好的做法是使用带有 qf
(查询字段)设置的 edismax
查询处理程序:
q=quanto posso pedir o subsídio de parentalidade&defType=edismax&qf=campo_text
这将使用所有单词搜索 campo_text
。然后,你可以使用 q.op=AND
或 q.op=OR
来调整是否需要同时出现所有单词,或者你可以使用 mm
(最小匹配)以更详细的方式调整配置文件。
英文:
You're not searching in the same field for all your values; in the first example you're searching for subsídio
in the campo_text
field and de parentalidade
in the default search field (since you didn't prefix those values with a field name).
In your second example you're searching for quando
in the campo_text
field and posso pedir o subsídio de parentalidade
in the default search field (since you're not prefixing those values).
In effect, subsídio
is present in campo_text
, while quando
is not - the default search field (by default _text_
) probably has no content, so no hits are produced.
If you want to support general user queries, it's usually a better idea to use the edismax
query handler with the qf
(query fields) setting:
q=quanto posso pedir o subsídio de parentalidade&defType=edismax&qf=campo_text
This will search campo_text
using all the words. You can then use q.op=AND
or q.op=OR
to adjust whether all words needs to be present or not, or you can use mm
(minimum match) to adjust the profile in a more detailed way.
答案2
得分: 1
问题出在Solr进行解析的方式上。MatsLindh的回答解释了Solr如何搜索字段中的单词。如果你想在字段中搜索例如campo_text
中的文本:
我想要一个汉堡。
那么Solr内部的解析查询应该是
parsedquery: 'campo_text': 我 'campo_text':想要 'campo_text':一个 'campo_text':汉堡'
(当使用debug=all参数时,可以访问此类查询)
在我的端口,我尝试了MatsLindh提供的解决方案,但注意到使用defType = edismax会将查询转换为以下形式:
{'rawquerystring': '我想要一个汉堡',
'querystring': '我想要一个汉堡',
'parsedquery': '+(DisjunctionMaxQuery((text:我)) DisjunctionMaxQuery((text:想要)) DisjunctionMaxQuery((text:一个)) DisjunctionMaxQuery((text:汉堡)))',
'parsedquery_toString': '+((text:我) (text:想要) (text:一个) (text:汉堡))'}
我的实现是在Python中,幸运的是有一个名为solrq
的包,允许您使用Q
类将文本解析到要搜索的字段中。在我的示例中,我使用了Q(text = '我想要一个汉堡')
。调试相同的查询后,现在我得到:
{'rawquerystring': 'text:我\\想要\\一个\\汉堡',
'querystring': 'text:我\\想要\\一个\\汉堡',
'parsedquery': 'text:我 text:想要 text:一个 text:汉堡',
'parsedquery_toString': 'text:我 text:想要 text:一个 text:汉堡'}
我已经在我正在进行的一个经验中测试了搜索查询的两种实现(defType = 'edismax'
和使用Q
解析器),在那里我正在查看在前k个检索到的文档中正确文档的准确性,并且在我的示例中使用Q
解析器获得了更好的结果:
top_1 | top_3 | top_5 | top_10 | |
---|---|---|---|---|
Q_parser_bm25 | 0.3054 | 0.4469 | 0.4988 | 0.5649 |
defType_edismax_bm25 | 0.2736 | 0.4009 | 0.4493 | 0.4988 |
英文:
The issue is due to the type of parsing Solr does. The answer from MatsLindh shares knowledge into how Solr searches for words in a field. If you want to search in a field for example campo_text
the text:
I want a burguer.
Then the parsed query inside Solr should be
parsedquery: '`campo_text`: I `campo_text`:want `campo_text`:a `campo_text`:burguer'
(this type of query can be accessed when using the debug=all parameter)
On my end, I tried the solution provided by MatsLindh but noticed that using the defType = edismax turns the query to the following:
{'rawquerystring': 'I want a burguer',
'querystring': 'I want a burguer',
'parsedquery': '+(DisjunctionMaxQuery((text:i)) DisjunctionMaxQuery((text:want)) DisjunctionMaxQuery((text:a)) DisjunctionMaxQuery((text:burger)))',
'parsedquery_toString': '+((text:i) (text:want) (text:a) (text:burguer))'}
My implementation is in Python and luckily there is a package named solrq
which allows you to parse your text to the fields you want to search in using the Q
Class. In my example I used Q(text = 'I want a burguer')
. Debugging the same query I now get:
{'rawquerystring': 'text:I\\ want\\ a\\ burguer',
'querystring': 'text:I\\ want\\ a\\ burguer',
'parsedquery': 'text:i text:want text:a text:burguer',
'parsedquery_toString': 'text:i text:want text:a text:burguer'}
I have tested both implementations of search queries (defType = 'edismax'
and using the Q
parser) on an experience I was working on where I'm looking at the accuracy of correct documents in the top k
retrieved documents and I have obtained better results using the Q
parser on my example:
top_1 | top_3 | top_5 | top_10 | |
---|---|---|---|---|
Q_parser_bm25 | 0.3054 | 0.4469 | 0.4988 | 0.5649 |
defType_edismax_bm25 | 0.2736 | 0.4009 | 0.4493 | 0.4988 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论