英文:
Apache Lucene returns NaN as score when sorting by relevance
问题
I want to order the results of my Apache Lucene search by relevance. But when I use SortField.FIELD_SCORE
for sorting, the score of the resulting documents is always NaN
. When I omit the sort parameter, the search works perfectly fine, and the result documents contain a valid score.
I use lucene-core 9.6.0
and lucene-analyzers-common 8.11.2
which are the most up to date versions in the Maven repository right now.
At first I thought, I messed up my index or query. But I'm able to reproduce the issue with the most simple implementation I can imagine:
public class LuceneSearch {
public static void main(String[] args) {
try {
Directory directory = new ByteBuffersDirectory();
try (IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(new SimpleAnalyzer()))) {
indexWriter.addDocument(createDocument("a very simple example"));
indexWriter.addDocument(createDocument("another example"));
indexWriter.addDocument(createDocument("hello world"));
}
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new TermQuery(new Term("value", "hello"));
Sort sort = new Sort(SortField.FIELD_SCORE); // <<< this causes the problem
TopDocs topDocs = indexSearcher.search(query, 10, sort);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
}
indexReader.close();
directory.close();
} catch (IOException e) {
e.printStackTrace();
}
}
private static Document createDocument(String value) {
Document document = new Document();
document.add(new TextField("value", value, Field.Store.NO));
return document;
}
}
When I run this simple code, I get 2 : NaN
. Without the sort
parameter, I get 2 : 0.49662238
. I have no idea what I'm missing here. Or could it be a bug in the library?
Edit: As @andrewJames stated in the comments, the ScoreDoc (actually FieldDoc
) object contains a property fields
which contains the score when using the sort parameter. After some testing, I found out that the actual score is identical in both cases (with/without sort parameter). So the sorting works correctly.
英文:
I want to order the results of my Apache Lucene search by relevance. But when I use SortField.FIELD_SCORE
for sorting, the score of the resulting documents is always NaN
.
When I omit the sort parameter, the search works perfectly fine, and the result documents contain a valid score.
I use lucene-core 9.6.0
and lucene-analyzers-common 8.11.2
which are the most up to date versions in the Maven repository right now.
At first I thought, I messed up my index or query. But I'm able to reproduce the issue with the most simple implementation I can imagine:
public class LuceneSearch {
public static void main(String[] args) {
try {
Directory directory = new ByteBuffersDirectory();
try (IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(new SimpleAnalyzer()))) {
indexWriter.addDocument(createDocument("a very simple example"));
indexWriter.addDocument(createDocument("another example"));
indexWriter.addDocument(createDocument("hello world"));
}
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new TermQuery(new Term("value", "hello"));
Sort sort = new Sort(SortField.FIELD_SCORE); // <<<< this causes the problem
TopDocs topDocs = indexSearcher.search(query, 10, sort);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
}
indexReader.close();
directory.close();
} catch (IOException e) {
e.printStackTrace();
}
}
private static Document createDocument(String value) {
Document document = new Document();
document.add(new TextField("value", value, Field.Store.NO));
return document;
}
}
When I run this simple code, I get 2 : NaN
. Without the sort
parameter, I get 2 : 0.49662238
. I have no idea what I'm missing here. Or could it be a bug in the library?
Thanks for your help!
Edit:
As @andrewJames stated in the comments, the ScoreDoc (actually FieldDoc
) object contains a property fields
which contains the score when using the sort parameter.
After some testing, I found out that the actual score is identical in both cases (with/without sort parameter). So the sorting works correctly.
答案1
得分: 1
Sorting会按照您提供的Sort
标准按您期望的方式工作。这等同于Lucene使用的默认“相关性”排序顺序。
如果需要,您仍然可以通过将ScoreDoc
强制转换为FieldDoc
来访问相关性分数。
使用以下定义的排序顺序:
Sort sort = new Sort(SortField.FIELD_SCORE);
与默认排序顺序相同,它按照分数(相关性)从高到低排序文档。因此,在这两种情况下,文档都将以相同的方式排序。
但是,当您使用显式排序时,无法再使用scoreDoc.score
来访问分数,正如问题中所指出的那样。而是会得到NaN
(不是一个数字)。
2 : NaN
然而,如果需要,您仍然可以通过将每个ScoreDoc
实例转换为FieldDoc
来访问分数。我们获得FieldDocs
是因为我们已经添加了一个排序字段到我们的搜索中。
FieldDoc
扩展了ScoreDoc
,它包含有关如何对引用文档进行排序的信息。
在我们的情况下,只有一个排序字段,即FIELD_SCORE
值。
因此,要打印分数,我们可以将这段代码从:
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
}
更改为:
for (ScoreDoc scoreDoc : topFieldDocs.scoreDocs) {
FieldDoc fieldDoc = (FieldDoc) scoreDoc;
System.out.println(scoreDoc.doc + " : " + fieldDoc.fields[0]);
}
现在,我们将打印分数,而不是NaN
:
2 : 0.49662238
**猜测:**我可能是错的,但我假设原始的scoreDoc.score
字段是NaN
,因为计算并存储它在这里没有意义,因为不能保证应用的搜索将使用SortField.FIELD_SCORE
。
我预计用户大多会希望按照与分数不同的某些内容进行排序,也许可以选择使用分数作为一个解决排序冲突的因素。
但是,如果使用了FIELD_SCORE
,那么分数将在该字段中可用。
另外,代替这样的写法:
TopDocs topDocs = indexSearcher.search(query, 10, sort);
您可以使用这样的写法:
TopFieldDocs topFieldDocs = indexSearcher.search(query, 10, sort);
这允许我们访问SortField[]
- 用于排序结果的字段。这包括有关字段类型的信息。
英文:
Short Answer
Sorting will work the way you expect, using your provided Sort
criterion. It is equivalent to the default "relevance" sort order used by Lucene.
You can still access the relevance score, if you want to, by casting ScoreDoc
to FieldDoc
.
Longer Answer
The sort order defined by:
Sort sort = new Sort(SortField.FIELD_SCORE);
is the same as the default sort order - which sorts by score (relevance) from highest to lowest. So, documents will be ordered in the same way in both cases.
But when you use an explicit sort, you can no longer access the score using scoreDoc.score
, as noted in the question. Instead you only get NaN
(not a number).
2 : NaN
However, you can still access the score (if you want to) by casting each ScoreDoc
instance to a FieldDoc
. We get FieldDocs
because we have added a sort field to our search.
FieldDoc
extends ScoreDoc
. It contains "information about how to sort the referenced document".
In our case, there is only one sort field and it is the FIELD_SCORE
value.
So, to print the score, we can change this code:
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
}
to this:
for (ScoreDoc scoreDoc : topFieldDocs.scoreDocs) {
FieldDoc fieldDoc = (FieldDoc) scoreDoc;
System.out.println(scoreDoc.doc + " : " + fieldDoc.fields[0]);
}
Now we will get the score printed, instead of NaN
:
2 : 0.49662238
Speculation: I may be wrong, but I assume the original scoreDoc.score
field is NaN
because it doesn't make sense to calculate it and store it here, given there is no guarantee that the applied search will use SortField.FIELD_SCORE
.
I expect users will mostly want to sort by something other than score - and maybe optionally use score as a tie-breaker.
But if FIELD_SCORE
is used, then the score will be available in that field, instead.
As an aside, instead of this:
TopDocs topDocs = indexSearcher.search(query, 10, sort);
You can use this:
TopFieldDocs topFieldDocs = indexSearcher.search(query, 10, sort);
This allows us to access SortField[]
- the fields which were used for sorting results. This includes information about field types.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论