Apache Lucene在按相关性排序时返回NaN作为分数。

huangapple go评论68阅读模式
英文:

Apache Lucene returns NaN as score when sorting by relevance

问题

I want to order the results of my Apache Lucene search by relevance. But when I use SortField.FIELD_SCORE for sorting, the score of the resulting documents is always NaN. When I omit the sort parameter, the search works perfectly fine, and the result documents contain a valid score.

I use lucene-core 9.6.0 and lucene-analyzers-common 8.11.2 which are the most up to date versions in the Maven repository right now.

At first I thought, I messed up my index or query. But I'm able to reproduce the issue with the most simple implementation I can imagine:

public class LuceneSearch {
    public static void main(String[] args) {
        try {
            Directory directory = new ByteBuffersDirectory();

            try (IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(new SimpleAnalyzer()))) {
                indexWriter.addDocument(createDocument("a very simple example"));
                indexWriter.addDocument(createDocument("another example"));
                indexWriter.addDocument(createDocument("hello world"));
            }

            IndexReader indexReader = DirectoryReader.open(directory);
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            Query query = new TermQuery(new Term("value", "hello"));
            Sort sort = new Sort(SortField.FIELD_SCORE); // <<< this causes the problem
            TopDocs topDocs = indexSearcher.search(query, 10, sort);
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
            }

            indexReader.close();
            directory.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static Document createDocument(String value) {
        Document document = new Document();
        document.add(new TextField("value", value, Field.Store.NO));
        return document;
    }
}

When I run this simple code, I get 2 : NaN. Without the sort parameter, I get 2 : 0.49662238. I have no idea what I'm missing here. Or could it be a bug in the library?

Edit: As @andrewJames stated in the comments, the ScoreDoc (actually FieldDoc) object contains a property fields which contains the score when using the sort parameter. After some testing, I found out that the actual score is identical in both cases (with/without sort parameter). So the sorting works correctly.

英文:

I want to order the results of my Apache Lucene search by relevance. But when I use SortField.FIELD_SCORE for sorting, the score of the resulting documents is always NaN.
When I omit the sort parameter, the search works perfectly fine, and the result documents contain a valid score.

I use lucene-core 9.6.0 and lucene-analyzers-common 8.11.2 which are the most up to date versions in the Maven repository right now.

At first I thought, I messed up my index or query. But I'm able to reproduce the issue with the most simple implementation I can imagine:

public class LuceneSearch {
    public static void main(String[] args) {
        try {
            Directory directory = new ByteBuffersDirectory();
            
            try (IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(new SimpleAnalyzer()))) {
                indexWriter.addDocument(createDocument(&quot;a very simple example&quot;));
                indexWriter.addDocument(createDocument(&quot;another example&quot;));
                indexWriter.addDocument(createDocument(&quot;hello world&quot;));
            }

            IndexReader indexReader = DirectoryReader.open(directory);
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            Query query = new TermQuery(new Term(&quot;value&quot;, &quot;hello&quot;));
            Sort sort = new Sort(SortField.FIELD_SCORE); // &lt;&lt;&lt;&lt; this causes the problem
            TopDocs topDocs = indexSearcher.search(query, 10, sort);
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                System.out.println(scoreDoc.doc + &quot; : &quot; + scoreDoc.score);
            }

            indexReader.close();
            directory.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static Document createDocument(String value) {
        Document document = new Document();
        document.add(new TextField(&quot;value&quot;, value, Field.Store.NO));
        return document;
    }
}

When I run this simple code, I get 2 : NaN. Without the sort parameter, I get 2 : 0.49662238. I have no idea what I'm missing here. Or could it be a bug in the library?
Thanks for your help!

Edit:
As @andrewJames stated in the comments, the ScoreDoc (actually FieldDoc) object contains a property fields which contains the score when using the sort parameter.
After some testing, I found out that the actual score is identical in both cases (with/without sort parameter). So the sorting works correctly.

答案1

得分: 1

Sorting会按照您提供的Sort标准按您期望的方式工作。这等同于Lucene使用的默认“相关性”排序顺序。

如果需要,您仍然可以通过将ScoreDoc强制转换为FieldDoc来访问相关性分数。

使用以下定义的排序顺序:

Sort sort = new Sort(SortField.FIELD_SCORE);

与默认排序顺序相同,它按照分数(相关性)从高到低排序文档。因此,在这两种情况下,文档都将以相同的方式排序。

但是,当您使用显式排序时,无法再使用scoreDoc.score来访问分数,正如问题中所指出的那样。而是会得到NaN(不是一个数字)。

2 : NaN

然而,如果需要,您仍然可以通过将每个ScoreDoc实例转换为FieldDoc来访问分数。我们获得FieldDocs是因为我们已经添加了一个排序字段到我们的搜索中。

FieldDoc扩展了ScoreDoc,它包含有关如何对引用文档进行排序的信息。

在我们的情况下,只有一个排序字段,即FIELD_SCORE值。

因此,要打印分数,我们可以将这段代码从:

for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
}

更改为:

for (ScoreDoc scoreDoc : topFieldDocs.scoreDocs) {
    FieldDoc fieldDoc = (FieldDoc) scoreDoc;
    System.out.println(scoreDoc.doc + " : " + fieldDoc.fields[0]);
}

现在,我们将打印分数,而不是NaN

2 : 0.49662238

**猜测:**我可能是错的,但我假设原始的scoreDoc.score字段是NaN,因为计算并存储它在这里没有意义,因为不能保证应用的搜索将使用SortField.FIELD_SCORE

我预计用户大多会希望按照与分数不同的某些内容进行排序,也许可以选择使用分数作为一个解决排序冲突的因素。

但是,如果使用了FIELD_SCORE,那么分数将在该字段中可用。

另外,代替这样的写法:

TopDocs topDocs = indexSearcher.search(query, 10, sort);

您可以使用这样的写法:

TopFieldDocs topFieldDocs = indexSearcher.search(query, 10, sort);

这允许我们访问SortField[] - 用于排序结果的字段。这包括有关字段类型的信息。

英文:

Short Answer

Sorting will work the way you expect, using your provided Sort criterion. It is equivalent to the default "relevance" sort order used by Lucene.

You can still access the relevance score, if you want to, by casting ScoreDoc to FieldDoc.


Longer Answer

The sort order defined by:

Sort sort = new Sort(SortField.FIELD_SCORE);

is the same as the default sort order - which sorts by score (relevance) from highest to lowest. So, documents will be ordered in the same way in both cases.

But when you use an explicit sort, you can no longer access the score using scoreDoc.score, as noted in the question. Instead you only get NaN (not a number).

2 : NaN

However, you can still access the score (if you want to) by casting each ScoreDoc instance to a FieldDoc. We get FieldDocs because we have added a sort field to our search.

FieldDoc extends ScoreDoc. It contains "information about how to sort the referenced document".

In our case, there is only one sort field and it is the FIELD_SCORE value.

So, to print the score, we can change this code:

for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.doc + &quot; : &quot; + scoreDoc.score);
}

to this:

for (ScoreDoc scoreDoc : topFieldDocs.scoreDocs) {
    FieldDoc fieldDoc = (FieldDoc) scoreDoc;
    System.out.println(scoreDoc.doc + &quot; : &quot; + fieldDoc.fields[0]);
}

Now we will get the score printed, instead of NaN:

2 : 0.49662238

Speculation: I may be wrong, but I assume the original scoreDoc.score field is NaN because it doesn't make sense to calculate it and store it here, given there is no guarantee that the applied search will use SortField.FIELD_SCORE.

I expect users will mostly want to sort by something other than score - and maybe optionally use score as a tie-breaker.

But if FIELD_SCORE is used, then the score will be available in that field, instead.


As an aside, instead of this:

TopDocs topDocs = indexSearcher.search(query, 10, sort);

You can use this:

TopFieldDocs topFieldDocs = indexSearcher.search(query, 10, sort);

This allows us to access SortField[] - the fields which were used for sorting results. This includes information about field types.

huangapple
  • 本文由 发表于 2023年5月22日 21:47:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76306857.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定