2023年2月16日 10:56:38go评论63阅读模式

英文:

Normalizer not removing accents

问题

I'm using Normalizer followed by a regex to remove accents, but I'm getting back the same string with the accents still there.

import java.text.*

const val INPUT = &quot;&#225;&#233;&#237;&#243;ů&#248;&quot;
fun main() {
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
}

Output:

> áéíóůø
>
> áéíóůø
>
> áéíóůø
>
> áéíóůø

Kotlin playground: https://pl.kotl.in/62l6rUEUm

I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent function uses Normalizer for its implementation (but apparently handles to special characters too).

What am I doing wrong?

英文:

I'm using Normalizer followed by a regex to remove accents, but I'm getting back the same string with the accents still there.

import java.text.*

const val INPUT = &quot;&#225;&#233;&#237;&#243;ů&#248;&quot;
fun main() {
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;, &quot;&quot;))
}

Output:

> áéíóůø
>
> áéíóůø
>
> áéíóůø
>
> áéíóůø

Kotlin playground: https://pl.kotl.in/62l6rUEUm

What am I doing wrong?

答案1

得分: 3

您没有将"[\\p{InCombiningDiacriticalMarks}]+"设为Regex。

这会产生：

aeiouø

请注意，ø 中的笔画不是变音符号。它既不能分解为：

"o" 和 U+0338 COMBINING LONG SOLIDUS OVERLAY，也不能分解为；
"o" 和 U+0337 COMBINING SHORT SOLIDUS OVERLAY。

您可以看到这三者看起来都有点不同：o̸øo̷

还请注意，Unicode 中还有两个包含组合变音符号的块，分别称为“Combining Diacritical Marks Extended” 和 “Combining Diacritical Marks Supplement”。考虑在您的正则表达式中包括它们。

英文:

You did not make "[\\p{InCombiningDiacriticalMarks}]+" a Regex.

println(
    Normalizer.normalize(INPUT, Normalizer.Form.NFD)
        .replace(&quot;[\\p{InCombiningDiacriticalMarks}]+&quot;.toRegex(), &quot;&quot;)
)

This produces:

aeiou&#248;

Notice that the stroke in ø is not a diacritic mark. It can be decomposed to neither

"o" and U+0338 COMBINING LONG SOLIDUS OVERLAY, or;
"o" and U+0337 COMBINING SHORT SOLIDUS OVERLAY

You can see that these three all look a bit different: o̸øo̷

Also notice that there are two more blocks in Unicode that contains combining diacritics, called "Combining Diacritical Marks Extended" and "Combining Diacritical Marks Supplement". Consider including those in your regex too.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Normalizer not removing accents

问题

答案1

Kotlin Moshi适配器在库引发JsonDataException时返回null。

How do i fix Runtime Error: Task :app:compileDebugKotlin FAILED e: This version (1.1.1) of the Compose Compile requires ….. known to be compatibler?

如何删除图像组件和文本组件之间的空白空间。

带有最后一项的示例流程

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论