2020年10月10日 06:40:23go评论75阅读模式

英文:

Conversion of Japanese "semi-voice" character

问题

我正在尝试比较两个包含日文字符的 Spark DataFrame，其中有一些字符在程序中看起来相同，但实际上是不同的，例如プ与プ。

如果你将它们放入 utf-8 编码器：

プ utf-8 = \xE3\x83\x97

プ utf-8 = \xE3\x83\x95\xE3\x82\x9A

看起来像是フ(\xE3\x83\x95) + 小圆圈半声标志(\xE3\x83\x95) = プ

这些差异被称为什么，是否有办法在 Java/Scala 中进行转换？

谢谢。

英文:

I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ

If you put them in utf-8 encoder:

プ utf-8 = \xE3\x83\x97

プ utf-8 = \xE3\x83\x95\xE3\x82\x9A

Looks like フ(\xE3\x83\x95) + the little circle semi-voice sign(\xE3\x83\x95) = プ

What are these difference called, and is there any way to convert between them in Java/Scala?

Thank you.

答案1

得分: 3

プ 即 \xE3\x83\x97（UTF-8 编码）对应 \u30d7，即 '片假名字母 PU' (U+30D7)。

プ 即 \xE3\x83\x95\xE3\x82\x9A（UTF-8 编码）对应 \u30d5\u309a，即 '片假名字母 HU' (U+30D5) 和 '片假名-平假名半浊音标记' (U+309A)。

正如你所看到的，第二个字符是一个基字符和一个组合字符。这类似于拉丁字符的变音符号，例如 ñ = n + ̃ ，即 \u00f1 = \u006e + \u0303。

你可以使用 Normalizer 类在这两种形式之间进行转换。请参阅：javadoc。

另请参阅：Java™ 教程 - 文本规范化。
另请参阅：将重音符号和字符合并为一个字符（Java 7）。

英文:

プ aka \xE3\x83\x97 (UTF-8) is \u30d7 aka 'KATAKANA LETTER PU' (U+30D7).

プ aka \xE3\x83\x95\xE3\x82\x9A (UTF-8) is \u30d5\u309a aka 'KATAKANA LETTER HU' (U+30D5) and 'COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK' (U+309A).

As you can see, the second is a base character and a combining character. This is the similar to how diacritical marks aka accent marks are done for Latin characters, e.g. how ñ = n + ̃ aka \u00f1 = \u006e + \u0303.

You can convert between the 2 forms using the Normalizer class. See: javadoc.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

“半音声” 日语字符的转换

问题

答案1

Apache-POI/ Java/ 写入Excel文件时跳过行

配置带有自定义基础存储库的Spring @DataJpaTest

如何使用JSON.ORG将JSON数组导入Java程序中：

如何让STS或Maven停止在项目目录中生成JAR文件？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论