英文:
Hibernate search sorting with collation
问题
我将Hibernate Search从版本4.3.0.Final升级到最新的稳定版本5.4.12.Final。除了挪威语单词的排序之外,一切都很好。在旧版本的Hibernate中,SortField的构造函数中有一个区域设置(locale)参数:
/** Creates a sort, possibly in reverse, by terms in the given field sorted
* according to the given locale.
* @param field Name of field to sort by, cannot be <code>null</code>.
* @param locale Locale of values in the field.
*/
public SortField (String field, Locale locale, boolean reverse) {
initFieldType(field, STRING);
this.locale = locale;
this.reverse = reverse;
}
但是在新版本的Hibernate Search中,SortField没有区域设置参数。根据Hibernate参考文档(https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#_analysis),为了对外语单词进行排序,我们应该使用带有规范化器的CollationKeyFilterFactory。但在这个版本的Hibernate Search中没有这样的类。Maven的pom.xml文件:
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-search-orm</artifactId>
<version>5.11.5.Final</version>
</dependency>
问题是:我应该在Hibernate Search中使用/创建什么来对挪威语单词进行排序?
现在我的排序顺序是:
> atest, btest, ctest, ztest, åtest, ætest, øtest
正确的顺序是:
> atest, btest, ctest, ztest, ætest, øtest, åtest
有一个CollationKeyAnalyzer类,但我不知道如何将其用于排序:
public final class CollationKeyAnalyzer extends Analyzer {
private final CollationAttributeFactory factory;
/**
* Create a new CollationKeyAnalyzer, using the specified collator.
*
* @param collator CollationKey generator
*/
public CollationKeyAnalyzer(Collator collator) {
this.factory = new CollationAttributeFactory(collator);
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
KeywordTokenizer tokenizer = new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
return new TokenStreamComponents(tokenizer, tokenizer);
}
}
非常相似但没有答案的问题:https://stackoverflow.com/questions/39264308/how-to-do-case-insensitive-sorting-of-norwegian-characters-%c3%86-%c3%98-and-%c3%85-using-h
英文:
I upgraded Hibernate search from version - 4.3.0.Final to the latest stable version - 5.4.12.Final. All is good except sorting norwegian words. In the old version of hibernate there was SortField with locale in the constructor:
/** Creates a sort, possibly in reverse, by terms in the given field sorted
* according to the given locale.
* @param field Name of field to sort by, cannot be <code>null</code>.
* @param locale Locale of values in the field.
*/
public SortField (String field, Locale locale, boolean reverse) {
initFieldType(field, STRING);
this.locale = locale;
this.reverse = reverse;
}
But in the new hibernate search SortField does not have locale. According to hibernate reference documentation (https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#_analysis) for sort words words in foreign languages we should use CollationKeyFilterFactory with normalizer. But there is no such class in this version of hibernate search. Maven pom:
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-search-orm</artifactId>
<version>5.11.5.Final</version>
</dependency>
The question: What should I use/create in the hibernate search for sort norwegian words?
Now I have such sort order:
> atest, btest, ctest, ztest, åtest, ætest, øtest
The correct order:
> atest, btest, ctest, ztest, ætest, øtest, åtest
There is CollationKeyAnalyzer class, but I do not know how to use this for sorting:
public final class CollationKeyAnalyzer extends Analyzer {
private final CollationAttributeFactory factory;
/**
* Create a new CollationKeyAnalyzer, using the specified collator.
*
* @param collator CollationKey generator
*/
public CollationKeyAnalyzer(Collator collator) {
this.factory = new CollationAttributeFactory(collator);
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
KeywordTokenizer tokenizer = new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
return new TokenStreamComponents(tokenizer, tokenizer);
}
}
Very similar question without answer: https://stackoverflow.com/questions/39264308/how-to-do-case-insensitive-sorting-of-norwegian-characters-%c3%86-%c3%98-and-%c3%85-using-h
答案1
得分: 1
我不确定它对您有多大帮助,但 CollationKeyFilterFactory
已被弃用并且确实被移除。
在该类的 Javadoc 中写道:
>已弃用。
>请使用 CollationKeyAnalyzer
代替。
您可以在此处找到 Javadoc。
英文:
I'm not sure how much it helps you but the CollationKeyFilterFactory
was deprecated and indeed removed.
In the class' Javadoc it says:
>Deprecated.
>use CollationKeyAnalyzer
instead.
You can find the Javadoc here.
答案2
得分: 1
> 但是在这个版本的 hibernate search 中没有这样的类。
这部分文档看起来已经过时了,我会查看并更新它。
我找到了 CollationKeyAnalyzer
,但是 javadoc 表明它已经过时了,应该使用 ICUCollationKeyAnalyzer
代替。
尝试将这个依赖项添加到你的 POM 文件中:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-icu</artifactId>
<version>5.5.5</version>
</dependency>
然后创建一个自定义的分析器类,重新实现 ICUCollationKeyAnalyzer
并使用硬编码的区域设置:
public class MyCollationKeyAnalyzer extends Analyzer {
private final ICUCollationAttributeFactory factory;
public MyCollationKeyAnalyzer(Version luceneVersion) {
this.factory = new ICUCollationAttributeFactory( Collator.getInstance( Locale.getInstance( "nb_NO" ) ) );
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
KeywordTokenizer tokenizer = new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
return new TokenStreamComponents(tokenizer, tokenizer);
}
}
然后创建你的字段:
@Entity
@Indexed
public class MyEntity {
// ...
@Field(name = "title_sort", index = Index.NO, normalizer = @Normalizer(impl = MyCollationKeyAnalyzer.class))
@SortableField(forField = "title_sort")
private String title;
// ...
}
然后像这样在该字段上进行排序:
FullTextEntityManager ftEm = Search.getFullTextEntityManager(entityManager);
QueryBuilder qb = ...; // 通常的创建方式
Query luceneQuery = ...; // 通常的创建方式
FullTextQuery ftQuery = ftEm.createFullTextQuery(luceneQuery, MyEntity.class);
ftQuery.setSort(qb.sort().byField("title_sort").createSort());
ftQuery.setMaxResults(20);
List<MyEntity> hits = ftQuery.getResultList();
我没有尝试过这个,所以如果对你有用的话,请告诉我们。
英文:
> But there is no such class in this version of hibernate search.
This part of the documentation looks obsolete, I'll look into updating it.
I found CollationKeyAnalyzer
, but the javadoc states that it's obsolete and that ICUCollationKeyAnalyzer
should be used instead.
Try adding this dependency to your POM:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-icu</artifactId>
<version>5.5.5</version>
</dependency>
Then create your own analyzer class that re-implements ICUCollationKeyAnalyzer
with a hard-coded locale:
public class MyCollationKeyAnalyzer extends Analyzer {
private final ICUCollationAttributeFactory factory;
public MyCollationKeyAnalyzer(Version luceneVersion) {
this.factory = new ICUCollationAttributeFactory( Collactor.getInstance( Locale.getInstance( "nb_NO" ) ) );
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
KeywordTokenizer tokenizer = new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
return new TokenStreamComponents(tokenizer, tokenizer);
}
}
Then create your field:
@Entity
@Indexed
public class MyEntity {
// ...
@Field(name = "title_sort", index = Index.NO, normalizer = @Normalizer(impl = MyCollationKeyAnalyzer.class))
@SortableField(forField = "title_sort")
private String title;
// ...
}
Then sort on that field like this:
FullTextEntityManager ftEm = Search.getFullTextEntityManager( entityManager );
QueryBuilder qb = ...; // The usual
Query luceneQuery = ...; // The usual
FullTextQuery ftQuery = ftEm.createFullTextQuery( luceneQuery, MyEntity.class );
ftQuery.setSort( qb.sort().byField( "title_sort" ).createSort() );
ftQuery.setMaxResults( 20 );
List<MyEntity> hits = ftQuery.getResultList();
I didn't try this though, so let us know if it worked for you.
答案3
得分: 1
为了解决排序问题,我创建了自己的NorwegianCollationFactory。尽管这不是完美的解决方案,因为我从旧版本的Hibernate Search(IndexableBinaryStringTools.class)中复制了代码,但它能正常工作。
NorwegianCollationFactory类:
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import java.text.Collator;
import java.util.Locale;
import java.util.Map;
public final class NorwegianCollationFactory extends TokenFilterFactory {
public NorwegianCollationFactory(Map<String, String> args) {
super(args);
}
@Override
public TokenStream create(TokenStream input) {
Collator norwegianCollator = Collator.getInstance(new Locale("no", "NO"));
return new CollationKeyFilter(input, norwegianCollator);
}
}
CollationKeyFilter类:
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.IOException;
import java.text.Collator;
import java.util.Objects;
public final class CollationKeyFilter extends TokenFilter {
// 这段代码是从旧版本的Hibernate Search 4.3.0.Final的IndexableBinaryStringTools.class中复制的
// ...
//(以下代码省略,因为长度较长)
}
Entity映射示例:
@Entity
@NormalizerDef(name = "textSortNormalizer",
filters = {
// ...(以下代码省略,因为长度较长)
@TokenFilterDef(factory = NorwegianCollationFactory.class)
}
)
public class Entity {
@Field(name = "name_for_sort", normalizer = @Normalizer(definition = "textSortNormalizer"))
@SortableField(forField = "name_for_sort")
private String name;
}
(以上代码已省略,因为长度较长)
英文:
In order to fix sorting I created my own NorwegianCollationFactory. It is not perfect solution as I copied code from old version of Hibernate Search (IndexableBinaryStringTools.class), but it is working fine.<br>
NorwegianCollationFactory class:
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import java.text.Collator;
import java.util.Locale;
import java.util.Map;
public final class NorwegianCollationFactory extends TokenFilterFactory {
public NorwegianCollationFactory(Map<String, String> args) {
super(args);
}
@Override
public TokenStream create(TokenStream input) {
Collator norwegianCollator = Collator.getInstance(new Locale("no", "NO"));
return new CollationKeyFilter(input, norwegianCollator);
}
}
CollationKeyFilter class:
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.IOException;
import java.text.Collator;
import java.util.Objects;
public final class CollationKeyFilter extends TokenFilter {
// This code is copied from IndexableBinaryStringTools.class from the old version of hibernate search 4.3.0.Final
private static final CollationKeyFilter.CodingCase[] CODING_CASES = {
new CollationKeyFilter.CodingCase(7, 1),
new CollationKeyFilter.CodingCase(14, 6, 2),
new CollationKeyFilter.CodingCase(13, 5, 3),
new CollationKeyFilter.CodingCase(12, 4, 4),
new CollationKeyFilter.CodingCase(11, 3, 5),
new CollationKeyFilter.CodingCase(10, 2, 6),
new CollationKeyFilter.CodingCase(9, 1, 7),
new CollationKeyFilter.CodingCase(8, 0)
};
private final Collator collator;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public CollationKeyFilter(TokenStream input, Collator collator) {
super(input);
this.collator = (Collator) collator.clone();
}
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
byte[] collationKey = collator.getCollationKey(termAtt.toString()).toByteArray();
int encodedLength = getBinaryStringEncodedLength(collationKey.length);
termAtt.resizeBuffer(encodedLength);
termAtt.setLength(encodedLength);
encodeToBinaryString(collationKey, collationKey.length, termAtt.buffer());
return true;
} else {
return false;
}
}
// This code is copied from IndexableBinaryStringTools class from the old version of hibernate search 4.3.0.Final
private void encodeToBinaryString(byte[] inputArray, int inputLength, char[] outputArray) {
if (inputLength > 0) {
int inputByteNum = 0;
int caseNum = 0;
int outputCharNum = 0;
CollationKeyFilter.CodingCase codingCase;
for (; inputByteNum + CODING_CASES[caseNum].numBytes <= inputLength; ++outputCharNum) {
codingCase = CODING_CASES[caseNum];
if (codingCase.numBytes == 2) {
outputArray[outputCharNum] = (char) (((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift)
+ (((inputArray[inputByteNum + 1] & 0xFF) >>> codingCase.finalShift) & codingCase.finalMask) & (short) 0x7FFF);
} else {
outputArray[outputCharNum] = (char) (((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift)
+ ((inputArray[inputByteNum + 1] & 0xFF) << codingCase.middleShift)
+ (((inputArray[inputByteNum + 2] & 0xFF) >>> codingCase.finalShift) & codingCase.finalMask) & (short) 0x7FFF);
}
inputByteNum += codingCase.advanceBytes;
if (++caseNum == CODING_CASES.length) {
caseNum = 0;
}
}
codingCase = CODING_CASES[caseNum];
if (inputByteNum + 1 < inputLength) {
outputArray[outputCharNum++] = (char) ((((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift)
+ ((inputArray[inputByteNum + 1] & 0xFF) << codingCase.middleShift)) & (short) 0x7FFF);
outputArray[outputCharNum] = (char) 1;
} else if (inputByteNum < inputLength) {
outputArray[outputCharNum++] = (char) (((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift) & (short) 0x7FFF);
outputArray[outputCharNum] = caseNum == 0 ? (char) 1 : (char) 0;
} else {
outputArray[outputCharNum] = (char) 1;
}
}
}
// This code is copied from IndexableBinaryStringTools class from the old version of hibernate search 4.3.0.Final
private int getBinaryStringEncodedLength(int inputLength) {
return (int) ((8L * inputLength + 14L) / 15L) + 1;
}
// This code is copied from IndexableBinaryStringTools class from the old version of hibernate search 4.3.0.Final
private static class CodingCase {
int numBytes;
int initialShift;
int middleShift;
int finalShift;
int advanceBytes = 2;
short middleMask;
short finalMask;
CodingCase(int initialShift, int middleShift, int finalShift) {
this.numBytes = 3;
this.initialShift = initialShift;
this.middleShift = middleShift;
this.finalShift = finalShift;
this.finalMask = (short) ((short) 0xFF >>> finalShift);
this.middleMask = (short) ((short) 0xFF << middleShift);
}
CodingCase(int initialShift, int finalShift) {
this.numBytes = 2;
this.initialShift = initialShift;
this.finalShift = finalShift;
this.finalMask = (short) ((short) 0xFF >>> finalShift);
if (finalShift != 0) {
advanceBytes = 1;
}
}
}
@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
if (!super.equals(o)) {
return false;
}
CollationKeyFilter that = (CollationKeyFilter) o;
return Objects.equals(collator, that.collator) &&
Objects.equals(termAtt, that.termAtt);
}
@Override
public int hashCode() {
return Objects.hash(super.hashCode(), collator, termAtt);
}
}
Entity mapping example:
@Entity
@NormalizerDef(name = "textSortNormalizer",
filters = {
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
@Parameter(name = "pattern", value = "('-&\\.,\\(\\))"),
@Parameter(name = "replacement", value = " "),
@Parameter(name = "replace", value = "all")
}),
@TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
@Parameter(name = "pattern", value = "([^0-9\\p{L} ])"),
@Parameter(name = "replacement", value = ""),
@Parameter(name = "replace", value = "all")
}),
@TokenFilterDef(factory = NorwegianCollationFactory.class)
}
)
public class Entity {
@Field(name = "name_for_sort", normalizer = @Normalizer(definition = "textSortNormalizer"))
@SortableField(forField = "name_for_sort")
private String name;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论