英文:
Replace Unicode Characters in a String
问题
我需要将带有变音符号的字符(例如ä,ó等)替换为它们的“基本”字符。对于大多数字符,这个解决方案有效:
StringUtils.stripAccents(tmpStr);
但是这个方法漏掉了四个字符:æ,œ,ø,和ß。
我查看了这个解决方案,链接在这里:https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette。我认为第一个解决方案会起作用,但实际上并没有。
如何将这些字符替换为它们的“基本”字符(例如,将æ替换为a)。
英文:
I need to replace diacritic characters (e.g. ä, ó, etc.) with their 'base' character. For most of the characters, this solution works:
StringUtils.stripAccents(tmpStr);
but this misses four characters: æ, œ, ø, and ß.
I took a look at this solution here https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette. I figured the first solution would work, but it does not.
How can I replace these characters with their 'base' character (e.g. replace æ with a).
答案1
得分: 2
以下是您要求的翻译部分:
源代码如下(链接:https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html):
public static String stripAccents(final String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
convertRemainingAccentCharacters(decomposed);
// 注意,这并未正确移除连字...
return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}
它有一个注释说:
// 注意,这并未正确移除连字...
因此您可能需要手动替换这些情况。类似于:
String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
string = string.replaceAll("\\p{M}", "");
string = string.replace("ß", "s");
string = string.replace("ø", "o");
string = string.replace("œ", "o");
string = string.replace("æ", "a");
变音字符到ASCII字符的映射
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html
英文:
The source code says (https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html),
public static String stripAccents(final String input) {
if (input == null) {
return null;
} final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD)); convertRemainingAccentCharacters(decomposed);
// Note that this doesn't correctly remove ligatures...
return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}
It has a comment that says,
// Note that this doesn't correctly remove ligatures...
So may be you need to manually replace those instances.
Something like,
String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
string = string.replaceAll("\\p{M}", "");
string = string.replace("ß", "s");
string = string.replace("ø", "o");
string = string.replace("œ", "o");
string = string.replace("æ", "a");
Diacritical Character to ASCII Character Mapping
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论