Lucene | 如何在字段开头找到前缀匹配?

huangapple go评论56阅读模式
英文:

Lucene | How to find prefix matches at beginning of field?

问题

我想要匹配字段开头附近的前缀。我有这个,但它不匹配前缀;它只有在搜索词完全匹配时才匹配整个单词。似乎没有办法将 SpanTermQuery 和 PrefixQuery 结合起来。

例如:

  • 搜索词:"Comp"
  • 想要找到:"Computer science class""Comp Sci"
  • 只找到:"Comp Sci"
  • 不想找到:"Apple's latest computer"

RegexpQuery 能理解位置吗?

英文:

I want to match prefixes near the start of a field. I have this, but it's not matching the prefix; it only matches the whole word if the search term matches it. It seems like there's no way to combine SpanTermQuery and PrefixQuery.

        var nameTerm = new Term("name", searchTerm);

        var prefixName = new PrefixQuery(nameTerm);

        var prefixAtStart = new BooleanQuery
        {
            { prefixName, Occur.MUST },
            {  new SpanFirstQuery(new SpanTermQuery(nameTerm), 0), Occur.MUST }
        };

For example:

  • Search term: "Comp"
  • Want to find: "Computer science class" and "Comp Sci"
  • Only finding: "Comp Sci"
  • Don't want to find: "Apple's latest computer"

Can the RegexpQuery be made to understand positions?

答案1

得分: 1

以下是翻译好的部分:

当您只想匹配前缀时,您可以通过为您的字段使用以下字段类型来实现。

<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

在这种情况下,查询将如下所示:

field:comp*

现在您有第二个需要使用NGramFilter的字段,所以您可以为您的字段使用以下字段类型。

<field name="text_prefix" type="text_prefix" indexed="true" stored="false"/>
<fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldType>
英文:

When you only want to match prefixes, you can do it by having below field type for your field.

&lt;analyzer&gt;
  &lt;tokenizer class=&quot;solr.KeywordTokenizerFactory&quot;/&gt;
  &lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/&gt;
&lt;/analyzer&gt;

then in this case the query would be like :

field:comp*

Now you have a second for which you need NGramFilter, so you can use the below field type for your field.

&lt;field name=&quot;text_prefix&quot; type=&quot;text_prefix&quot; indexed=&quot;true&quot; stored=&quot;false&quot;/&gt;

&lt;fieldType name=&quot;text_prefix&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;&gt;
        &lt;analyzer type=&quot;index&quot;&gt;
            &lt;tokenizer class=&quot;solr.LowerCaseTokenizerFactory&quot;/&gt;
            &lt;filter class=&quot;solr.EdgeNGramFilterFactory&quot; minGramSize=&quot;3&quot; maxGramSize=&quot;15&quot; side=&quot;front&quot;/&gt;
        &lt;/analyzer&gt;
        &lt;analyzer type=&quot;query&quot;&gt;
            &lt;tokenizer class=&quot;solr.LowerCaseTokenizerFactory&quot;/&gt;
        &lt;/analyzer&gt;
    &lt;/fieldType&gt;

答案2

得分: 0

以下是Lucene.Net设置EdgeNGramFilter的方式,翻译如下:

public class CustomAnalyzer : Analyzer
{
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_48, reader);

        TokenFilter filter = new EdgeNGramTokenFilter(LuceneVersion.LUCENE_48, tokenizer, 3, 10);

        return new TokenStreamComponents(tokenizer, filter);
    }
}
英文:

Translating Abhijit's response, here is the Lucene.Net way to setup the EdgeNGramFilter:

public class CustomAnalyzer : Analyzer
{
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_48, reader);

        TokenFilter filter = new EdgeNGramTokenFilter(LuceneVersion.LUCENE_48, tokenizer, 3, 10);

        return new TokenStreamComponents(tokenizer, filter);
    }
}

huangapple
  • 本文由 发表于 2023年6月8日 07:35:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定