替换字符串中的Unicode字符

huangapple go评论88阅读模式
英文:

Replace Unicode Characters in a String

问题

我需要将带有变音符号的字符(例如ä,ó等)替换为它们的“基本”字符。对于大多数字符,这个解决方案有效:

StringUtils.stripAccents(tmpStr);

但是这个方法漏掉了四个字符:æ,œ,ø,和ß。

我查看了这个解决方案,链接在这里:https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette。我认为第一个解决方案会起作用,但实际上并没有。

如何将这些字符替换为它们的“基本”字符(例如,将æ替换为a)。

英文:

I need to replace diacritic characters (e.g. ä, ó, etc.) with their 'base' character. For most of the characters, this solution works:

StringUtils.stripAccents(tmpStr);

but this misses four characters: æ, œ, ø, and ß.

I took a look at this solution here https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette. I figured the first solution would work, but it does not.

How can I replace these characters with their 'base' character (e.g. replace æ with a).

答案1

得分: 2

以下是您要求的翻译部分:

源代码如下(链接:https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html):

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }
    final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
    convertRemainingAccentCharacters(decomposed);

    // 注意,这并未正确移除连字...
     
    return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}

它有一个注释说:
// 注意,这并未正确移除连字...

因此您可能需要手动替换这些情况。类似于:

String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
string = string.replaceAll("\\p{M}", "");

string = string.replace("ß", "s");
string = string.replace("ø", "o");
string = string.replace("œ", "o");
string = string.replace("æ", "a");

变音字符到ASCII字符的映射
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

英文:

The source code says (https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html),

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }        final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));        convertRemainingAccentCharacters(decomposed);        

    // Note that this doesn't correctly remove ligatures...   
 
    return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);    
}

It has a comment that says,
// Note that this doesn't correctly remove ligatures...

So may be you need to manually replace those instances.
Something like,

    String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
    string = string.replaceAll("\\p{M}", "");

    string = string.replace("ß", "s");
    string = string.replace("ø", "o");
    string = string.replace("œ", "o");
    string = string.replace("æ", "a");

Diacritical Character to ASCII Character Mapping
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

huangapple
  • 本文由 发表于 2020年9月25日 00:27:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/64050572.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定