问题

However, as per the official documentation, method in ASCIIFoldingFilter
d-color:#272822;">public static String normalizeText(String text, boolean shouldTrim, boolean shouldLowerCase) { if (Strings.isNullOrEmpty(text)) { return text; } if (shouldTrim) { text = text.trim(); } if (shouldLowerCase) { text = text.toLowerCase(); } char[] charArray = text.toCharArray(); // once a character is normalized it could become more than 1 character. style="color:#75715e"> // Official document says the output length should be of size >= length * 4. style="color:#75715e"> char[] out = new char[charArray.length * 4 + 1]; int outLength = ASCIIFoldingFilter.foldToASCII(charArray, 0, out, 0, charArray.length); return String.copyValueOf(out, 0, outLength); style="color:#f92672">} the method has a note "This API is for internal purposes only and might change in incompatible ways in the next release." The alternative is to use the foldToASCII(char[] input, int length) non-static method (this method internally calls the same static method) but using it requires preparing ASCII folding filter, token filter, token stream, an analyzer. This might involve creating a custom analyzer. Examples of developers using this approach are not readily available.


It's worth noting that some open source projects use the static foldToASCII, raising the question of whether it is truly beneficial to use the non-static foldToASCII.
Official documentation

英文:
I have been using ASCII folding filter to handle diacritics for not just the documents in elastic search but various other kinds of strings.
public static String normalizeText(String text, boolean shouldTrim, boolean shouldLowerCase) {
        if (Strings.isNullOrEmpty(text)) {
            return text;
        }
        if (shouldTrim) {
            text = text.trim();
        }
        if (shouldLowerCase) {
            text = text.toLowerCase();
        }
        char[] charArray = text.toCharArray();

        // once a character is normalized it could become more than 1 character. Official document says the output
        // length should be of size &gt;= length * 4.
        char[] out = new char[charArray.length * 4 + 1];
        int outLength = ASCIIFoldingFilter.foldToASCII(charArray, 0, out, 0, charArray.length);
        return String.copyValueOf(out, 0, outLength);
    }

However, as per the official documentation, the method has a note This API is for internal purposes only and might change in incompatible ways in the next release. The alternative is to use foldToASCII(char[] input, int length) non-static method (this method internally calls the same static method) but using it requires preparing ascii folding filter, token filter, token stream, an analyzer (this requires choosing the kind of analyzer and I might have to create a custom one). I couldn't find examples where the developers have done the latter.

I tried writing some solutions of my own, but non-static foldingToAscii doesn't return the exact output, it attaches a list of unwanted characters in the end. I am wondering how various developers have dealt with this?
EDIT: I also see that some open source projects are using static foldToAscii so another question would be if it is really worth it to use non static foldToAscii

答案1
得分: 1
根据@andrewJames的评论，以下是我能够得出的最接近的方法，不使用静态方法。 KeyworkdTokenizer 将整个输入作为单个标记发出，因此无需遍历标记。
String text = "Caffè";
String output = "";

try (Analyzer analyzer = CustomAnalyzer.builder()
          .withTokenizer(KeywordTokenizerFactory.class)
          .addTokenFilter(ASCIIFoldingFilterFactory.class)
          .build()) {
    try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
        CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        if (ts.incrementToken()) {
            output = charTermAtt.toString();
        }
        ts.end();
    }
} catch (IOException e) {
}

System.out.println(output);


英文:
Based on comment by @andrewJames, below is the closest I was able to come up with not using the static method. KeyworkdTokenizer emits the entire input as a single token, so there is no need to loop through tokens.
String text = &quot;Caff&#232;&quot;;
String output = &quot;&quot;;

try (Analyzer analyzer = CustomAnalyzer.builder()
		      .withTokenizer(KeywordTokenizerFactory.class)
		      .addTokenFilter(ASCIIFoldingFilterFactory.class)
		      .build()) {
    try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
        CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        if (ts.incrementToken()) {
            output = charTermAtt.toString();
        }
        ts.end();
    }
} catch (IOException e) {
}

System.out.println(output);







通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。






						
	

点赞		

					https://go.coder-hub.com/63585571.html
复制链接
复制链接
		


go

使用 ASCIIFoldingFilter 中的静态 “foldToAscii” 方法。

问题

答案1

LWJGL在调用glDrawElements时崩溃。

If I have a Fragment with 3 possible layouts, how can I swap the layouts at runtime?

为什么两个文本字段中的逗号在不聚焦的情况下同时放置？

单线程的NodeJS和Go语言相对于Java的多线程有哪些优势？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论