Convert accent characters to English using Java.

huangapple go评论59阅读模式
英文:

Convert accent characters to english using java

问题

我有一个需求,需要搜索包含重音字符的用户,这些字符可能来自冰岛和日本。我编写的代码对一些重音字符有效,但不是全部。
以下是示例:

À - 返回a。正确。
 - 返回a。正确。
Ð - 返回Ð。这是错误的。应该返回e。
Õ - 返回Õ。这是错误的。应该返回o。

以下是我的代码:

String accentConvertStr = StringUtils.stripAccents(myKey);

也尝试了这个:

byte[] b = key.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));

请建议。

英文:

I have a requirement where i need to search with accent characters that can be for users from Iceland and Japan. The code which i wrote works for a few accent characters but not all.
Below example -

À - returns a. Correct.
 - returns a. Correct.
Ð - returns Ð. This is breaking. It should return e.
Õ - returns Õ. This is breaking. It should return o.

Below is my code :-

String accentConvertStr = StringUtils.stripAccents(myKey);

Tried this too :-

byte[] b = key.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));

Please advise.

答案1

得分: 0

I would say it works as expected. The underlying code of StringUtils.stripAccents is actually following.

String[] chars  = new String[]{"À","Â","Ð","Õ"};

for(String c : chars){
  String normalized = Normalizer.normalize(c,Normalizer.Form.NFD);
  System.out.println(normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
}

This will output:
A
A
Ð
O

If you read https://stackoverflow.com/a/5697575/9671280 answer, you will find

Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.

You could handle it separately if you still want to use StringUtil.stripAccents.

Please try https://github.com/xuender/unidecode it seems to work for your case.

 String normalized = Unidecode.decode(input);
英文:

I would say it works as expected. The underlying code of StringUtils.stripAccents is actually following.

String[] chars  = new String[]{"À","Â","Ð","Õ"};

for(String c : chars){
  String normalized = Normalizer.normalize(c,Normalizer.Form.NFD);
  System.out.println(normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
}

This will output:
A
A
Ð
O

If you read https://stackoverflow.com/a/5697575/9671280 answer, you will find

Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.

You could handle it separately if you still want to use StringUtil.stripAccents.

Please try https://github.com/xuender/unidecode it seems to work for your case.

 String normalized = Unidecode.decode(input);

huangapple
  • 本文由 发表于 2020年8月14日 16:13:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/63409010.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定