“半音声” 日语字符的转换

huangapple go评论75阅读模式
英文:

Conversion of Japanese "semi-voice" character

问题

我正在尝试比较两个包含日文字符的 Spark DataFrame,其中有一些字符在程序中看起来相同,但实际上是不同的,例如 プ 与 プ。

如果你将它们放入 utf-8 编码器:

プ utf-8 = \xE3\x83\x97

プ utf-8 = \xE3\x83\x95\xE3\x82\x9A

看起来像是 フ(\xE3\x83\x95) + 小圆圈半声标志(\xE3\x83\x95) = プ

这些差异被称为什么,是否有办法在 Java/Scala 中进行转换?

谢谢。

英文:

I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ

If you put them in utf-8 encoder:

プ utf-8 = \xE3\x83\x97

プ utf-8 = \xE3\x83\x95\xE3\x82\x9A

Looks like フ(\xE3\x83\x95) + the little circle semi-voice sign(\xE3\x83\x95) = プ

What are these difference called, and is there any way to convert between them in Java/Scala?

Thank you.

答案1

得分: 3

\xE3\x83\x97(UTF-8 编码)对应 \u30d7,即 '片假名字母 PU' (U+30D7)

プ\xE3\x83\x95\xE3\x82\x9A(UTF-8 编码)对应 \u30d5\u309a,即 '片假名字母 HU' (U+30D5)'片假名-平假名半浊音标记' (U+309A)

正如你所看到的,第二个字符是一个基字符和一个组合字符。这类似于拉丁字符的变音符号,例如 ñ = n + ̃ ,即 \u00f1 = \u006e + \u0303

你可以使用 Normalizer 类在这两种形式之间进行转换。请参阅:javadoc

另请参阅:Java™ 教程 - 文本规范化
另请参阅:将重音符号和字符合并为一个字符(Java 7)

英文:

aka \xE3\x83\x97 (UTF-8) is \u30d7 aka 'KATAKANA LETTER PU' (U+30D7).

プ aka \xE3\x83\x95\xE3\x82\x9A (UTF-8) is \u30d5\u309a aka 'KATAKANA LETTER HU' (U+30D5) and 'COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK' (U+309A).

As you can see, the second is a base character and a combining character. This is the similar to how diacritical marks aka accent marks are done for Latin characters, e.g. how ñ = n + ̃ aka \u00f1 = \u006e + \u0303.

You can convert between the 2 forms using the Normalizer class. See: javadoc.

See also: The Java™ Tutorials - Normalizing Text.
See also: Combining accent and character into one character in java 7

huangapple
  • 本文由 发表于 2020年10月10日 06:40:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/64288171.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定