英文:
Conversion of Japanese "semi-voice" character
问题
我正在尝试比较两个包含日文字符的 Spark DataFrame,其中有一些字符在程序中看起来相同,但实际上是不同的,例如 プ 与 プ。
如果你将它们放入 utf-8 编码器:
プ utf-8 = \xE3\x83\x97
プ utf-8 = \xE3\x83\x95\xE3\x82\x9A
看起来像是 フ(\xE3\x83\x95) + 小圆圈半声标志(\xE3\x83\x95) = プ
这些差异被称为什么,是否有办法在 Java/Scala 中进行转换?
谢谢。
英文:
I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ
If you put them in utf-8 encoder:
プ utf-8 = \xE3\x83\x97
プ utf-8 = \xE3\x83\x95\xE3\x82\x9A
Looks like フ(\xE3\x83\x95) + the little circle semi-voice sign(\xE3\x83\x95) = プ
What are these difference called, and is there any way to convert between them in Java/Scala?
Thank you.
答案1
得分: 3
プ
即 \xE3\x83\x97
(UTF-8 编码)对应 \u30d7
,即 '片假名字母 PU' (U+30D7)。
プ
即 \xE3\x83\x95\xE3\x82\x9A
(UTF-8 编码)对应 \u30d5\u309a
,即 '片假名字母 HU' (U+30D5) 和 '片假名-平假名半浊音标记' (U+309A)。
正如你所看到的,第二个字符是一个基字符和一个组合字符。这类似于拉丁字符的变音符号,例如 ñ
= n
+ ̃
,即 \u00f1
= \u006e
+ \u0303
。
你可以使用 Normalizer
类在这两种形式之间进行转换。请参阅:javadoc。
另请参阅:Java™ 教程 - 文本规范化。
另请参阅:将重音符号和字符合并为一个字符(Java 7)。
英文:
プ
aka \xE3\x83\x97
(UTF-8) is \u30d7
aka 'KATAKANA LETTER PU' (U+30D7).
プ
aka \xE3\x83\x95\xE3\x82\x9A
(UTF-8) is \u30d5\u309a
aka 'KATAKANA LETTER HU' (U+30D5) and 'COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK' (U+309A).
As you can see, the second is a base character and a combining character. This is the similar to how diacritical marks aka accent marks are done for Latin characters, e.g. how ñ
= n
+ ̃
aka \u00f1
= \u006e
+ \u0303
.
You can convert between the 2 forms using the Normalizer
class. See: javadoc.
See also: The Java™ Tutorials - Normalizing Text.
See also: Combining accent and character into one character in java 7
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论