英文:
Normalizer not removing accents
问题
I'm using Normalizer
followed by a regex to remove accents, but I'm getting back the same string with the accents still there.
import java.text.*
const val INPUT = "áéíóůø"
fun main() {
println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
}
Output:
> áéíóůø
>
> áéíóůø
>
> áéíóůø
>
> áéíóůø
Kotlin playground: https://pl.kotl.in/62l6rUEUm
I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer
plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent
function uses Normalizer
for its implementation (but apparently handles to special characters too).
What am I doing wrong?
英文:
I'm using Normalizer
followed by a regex to remove accents, but I'm getting back the same string with the accents still there.
import java.text.*
const val INPUT = "áéíóůø"
fun main() {
println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
}
Output:
> áéíóůø
>
> áéíóůø
>
> áéíóůø
>
> áéíóůø
Kotlin playground: https://pl.kotl.in/62l6rUEUm
I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer
plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent
function uses Normalizer
for its implementation (but apparently handles to special characters too).
What am I doing wrong?
答案1
得分: 3
您没有将"[\\p{InCombiningDiacriticalMarks}]+"
设为Regex
。
这会产生:
aeiouø
请注意,ø
中的笔画不是变音符号。它既不能分解为:
- "o" 和 U+0338 COMBINING LONG SOLIDUS OVERLAY,也不能分解为;
- "o" 和 U+0337 COMBINING SHORT SOLIDUS OVERLAY。
您可以看到这三者看起来都有点不同:o̸øo̷
还请注意,Unicode 中还有两个包含组合变音符号的块,分别称为“Combining Diacritical Marks Extended” 和 “Combining Diacritical Marks Supplement”。考虑在您的正则表达式中包括它们。
英文:
You did not make "[\\p{InCombiningDiacriticalMarks}]+"
a Regex
.
println(
Normalizer.normalize(INPUT, Normalizer.Form.NFD)
.replace("[\\p{InCombiningDiacriticalMarks}]+".toRegex(), "")
)
This produces:
aeiouø
Notice that the stroke in ø
is not a diacritic mark. It can be decomposed to neither
- "o" and U+0338 COMBINING LONG SOLIDUS OVERLAY, or;
- "o" and U+0337 COMBINING SHORT SOLIDUS OVERLAY
You can see that these three all look a bit different: o̸øo̷
Also notice that there are two more blocks in Unicode that contains combining diacritics, called "Combining Diacritical Marks Extended" and "Combining Diacritical Marks Supplement". Consider including those in your regex too.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论