英文:
Convert accent characters to english using java
问题
我有一个需求,需要搜索包含重音字符的用户,这些字符可能来自冰岛和日本。我编写的代码对一些重音字符有效,但不是全部。
以下是示例:
À - 返回a。正确。
 - 返回a。正确。
Ð - 返回Ð。这是错误的。应该返回e。
Õ - 返回Õ。这是错误的。应该返回o。
以下是我的代码:
String accentConvertStr = StringUtils.stripAccents(myKey);
也尝试了这个:
byte[] b = key.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
请建议。
英文:
I have a requirement where i need to search with accent characters that can be for users from Iceland
and Japan
. The code which i wrote works for a few accent characters but not all.
Below example -
À - returns a. Correct.
 - returns a. Correct.
Ð - returns Ð. This is breaking. It should return e.
Õ - returns Õ. This is breaking. It should return o.
Below is my code :-
String accentConvertStr = StringUtils.stripAccents(myKey);
Tried this too :-
byte[] b = key.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
Please advise.
答案1
得分: 0
I would say it works as expected. The underlying code of StringUtils.stripAccents is actually following.
String[] chars = new String[]{"À","Â","Ð","Õ"};
for(String c : chars){
String normalized = Normalizer.normalize(c,Normalizer.Form.NFD);
System.out.println(normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
}
This will output:
A
A
Ð
O
If you read https://stackoverflow.com/a/5697575/9671280 answer, you will find
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
You could handle it separately if you still want to use StringUtil.stripAccents.
Please try https://github.com/xuender/unidecode it seems to work for your case.
String normalized = Unidecode.decode(input);
英文:
I would say it works as expected. The underlying code of StringUtils.stripAccents is actually following.
String[] chars = new String[]{"À","Â","Ð","Õ"};
for(String c : chars){
String normalized = Normalizer.normalize(c,Normalizer.Form.NFD);
System.out.println(normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
}
This will output:
A
A
Ð
O
If you read https://stackoverflow.com/a/5697575/9671280 answer, you will find
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
You could handle it separately if you still want to use StringUtil.stripAccents.
Please try https://github.com/xuender/unidecode it seems to work for your case.
String normalized = Unidecode.decode(input);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论