Normalizer not removing accents

huangapple go评论59阅读模式
英文:

Normalizer not removing accents

问题

I'm using Normalizer followed by a regex to remove accents, but I'm getting back the same string with the accents still there.

import java.text.*

const val INPUT = "áéíóůø"
fun main() {
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
    println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
}

Output:

> áéíóůø
>
> áéíóůø
>
> áéíóůø
>
> áéíóůø

Kotlin playground: https://pl.kotl.in/62l6rUEUm

I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent function uses Normalizer for its implementation (but apparently handles to special characters too).

What am I doing wrong?

英文:

I'm using Normalizer followed by a regex to remove accents, but I'm getting back the same string with the accents still there.

import java.text.*

const val INPUT = "áéíóůø"
fun main() {
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
	println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
}

Output:

> áéíóůø
>
> áéíóůø
>
> áéíóůø
>
> áéíóůø

Kotlin playground: https://pl.kotl.in/62l6rUEUm

I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent function uses Normalizer for its implementation (but apparently handles to special characters too).

What am I doing wrong?

答案1

得分: 3

您没有将"[\\p{InCombiningDiacriticalMarks}]+"设为Regex

这会产生:

aeiouø

请注意,ø 中的笔画不是变音符号。它既不能分解为:

  • "o" 和 U+0338 COMBINING LONG SOLIDUS OVERLAY,也不能分解为;
  • "o" 和 U+0337 COMBINING SHORT SOLIDUS OVERLAY。

您可以看到这三者看起来都有点不同:o̸øo̷

还请注意,Unicode 中还有两个包含组合变音符号的块,分别称为“Combining Diacritical Marks Extended” 和 “Combining Diacritical Marks Supplement”。考虑在您的正则表达式中包括它们。

英文:

You did not make "[\\p{InCombiningDiacriticalMarks}]+" a Regex.

println(
    Normalizer.normalize(INPUT, Normalizer.Form.NFD)
        .replace("[\\p{InCombiningDiacriticalMarks}]+".toRegex(), "")
)

This produces:

aeiouø

Notice that the stroke in ø is not a diacritic mark. It can be decomposed to neither

  • "o" and U+0338 COMBINING LONG SOLIDUS OVERLAY, or;
  • "o" and U+0337 COMBINING SHORT SOLIDUS OVERLAY

You can see that these three all look a bit different: o̸øo̷

Also notice that there are two more blocks in Unicode that contains combining diacritics, called "Combining Diacritical Marks Extended" and "Combining Diacritical Marks Supplement". Consider including those in your regex too.

huangapple
  • 本文由 发表于 2023年2月16日 10:56:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75467396.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定