英文:
How to pad Strings with Unicode characters in Java
问题
// 我将右填充添加到字符串以以表格格式输出。
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
// 结果如下(随机测试数据):
znZfmOEQ0Gb68taaNU6HY21lvo -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J -> lHJ5r7YDV0jTL
NxtHP -> odvPJklwIzZZ
NX2scXjl5dxWmer -> wPDlKCKllVKk
x2HKsSHCqDQ -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI -> 05MHjvTOxlxq1bvQ8RGe
// 当存在多字节Unicode字符时,此方法无效:
0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS -> SZX
WtP9t -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS -> KI
a71?⚖TZ💣🧜♀🕓ws5J -> b8A
如您所见,对齐有问题。
我的想法是计算字符串长度与使用的字节数之间的差异,然后将其用于偏移填充,类似于这样:
int correction = tuple[0].getBytes().length - tuple[0].length();
然后,我会将填充从32个字符调整为 `32 + correction`。然而,这也没有起作用。
以下是我的测试代码(使用 [emoji-java](https://github.com/vdurmont/emoji-java) 但行为应该可以在任何Unicode字符下复现):
import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;
public class Test {
public static void main(String[] args) {
// 创建随机测试数据
String[][] testData = new String[15][2];
for (String[] tuple : testData) {
tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
}
// 添加一些表情符号
Collection<Emoji> all = EmojiManager.getAll();
for (String[] tuple : testData) {
for (int i = 1; i < tuple[0].length(); i++) {
if (Math.random() > 0.90) {
Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
}
}
}
// 输出
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
}
}
英文:
I add right padding to a String to output it in a table format.
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
The result looks like this (random test data):
znZfmOEQ0Gb68taaNU6HY21lvo -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J -> lHJ5r7YDV0jTL
NxtHP -> odvPJklwIzZZ
NX2scXjl5dxWmer -> wPDlKCKllVKk
x2HKsSHCqDQ -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI -> 05MHjvTOxlxq1bvQ8RGe
This approach does not work when there are multi-byte unicode characters:
0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS -> SZX
WtP9t -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS -> KI
a71?⚖TZ💣🧜♀🕓ws5J -> b8A
As you can see, the alignment is off.
My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:
int correction = tuple[0].getBytes().length - tuple[0].length();
And then instead of padding to 32 chars, I would pad to 32 + correction
. However, this didn't work either.
Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):
import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;
public class Test {
public static void main(String[] args) {
// create random test data
String[][] testData = new String[15][2];
for (String[] tuple : testData) {
tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
}
// add some emojis
Collection<Emoji> all = EmojiManager.getAll();
for (String[] tuple : testData) {
for (int i = 1; i < tuple[0].length(); i++) {
if (Math.random() > 0.90) {
Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
}
}
}
// output
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
}
}
答案1
得分: 2
以下是翻译好的部分:
String类报告了错误的长度
String类使用char
,这些是Unicode代码点的16位整数。问题在于,并不是所有的代码点都适合16位,只有来自基本多文种平面(BMP)的代码点适合这些char
。String
的length()
方法返回char
的数量,而不是代码点的数量。
现在String
的codePointCount
方法在这种情况下可能有所帮助:它计算给定索引范围内的代码点数量。因此,将string.length()
作为第二个参数提供给该方法,可以返回代码点的总数。
合并字符
然而,还有另一个问题。例如,🇨🇳 中国国旗由两个Unicode代码点组成:区域指示符号字母C(🇨,U+1F1E8)和N(🇳,U+1F1F3)。这两个代码点被合并成中国的国旗。使用codePointCount
方法无法解决这个问题。
区域指示符号字母似乎 是一个特殊情况。这两个字符可以合并成一个国旗。我不知道有一种标准的方法来实现你想要的。你可能需要手动考虑这一点。
我写了一个小程序来获取字符串的长度。
static int length(String str) {
String a = "\uD83C\uDDE6";
String z = "\uD83C\uDDFF";
Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
Matcher m = p.matcher(str);
int count = 0;
while (m.find()) {
count++;
}
return str.codePointCount(0, str.length()) - count;
}
英文:
There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).
The String class reports an incorrect length
The String class works with char
s, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those char
s. String
's length()
method returns the number of char
s, not the number of code points.
Now String
's codePointCount
method may help in this case: it counts the number of code points in the given index range. So providing string.length()
as second argument to the method returns the total count of code points.
Combining characters
However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points: the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount
method.
The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.
I've written a small program to get the length of a string.
static int length(String str) {
String a = "\uD83C\uDDE6";
String z = "\uD83C\uDDFF";
Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
Matcher m = p.matcher(str);
int count = 0;
while (m.find()) {
count++;
}
return str.codePointCount(0, str.length()) - count;
}
答案2
得分: 1
根据@Xehpuk在链接的问题中所讨论的评论,以及在kotlinlang.org上的这个讨论以及Daniel Lemire在这篇博文中所述,以下似乎是正确的:
问题在于Java的String类将字符表示为UTF-16字符。这意味着任何用多于16位表示的Unicode字符都会保存为2个单独的Char值。许多String内部的函数忽略了这个事实,例如,String.length不会返回Unicode字符的数量,它返回的是String中16位字符的数量,一些表情符号计为2个字符。
然而,行为似乎是与具体实现相关的。
正如David在他的帖子中提到的,你可以尝试以下方法来获取正确的长度:
tuple.codePointCount(0, tuple.length())
请参阅Java SE文档中的代码点方法。
英文:
As is discussed by the comments in the question linked to by @Xehpuk, in this discussion on kotlinlang.org as well as in this blog post by Daniel Lemire the following seems to be correct:
> The problem is that the java String class represents characters as
> UTF-16 characters. This means any unicode character that is
> represented by more than 16 bits is saved as 2 separate Char values.
> This fact is ignored by many of the functions within String, eg.
> String.lenght does not return the number of unicode characters, it
> returns the number of 16bit characters within the String, some emoji
> counting for 2 characters.
The behaviour, however, seems to be implementation-specific.
As David mentions in his post you could try the following to get the correct lenght:
tuple.codePointCount(0, tuple.length())
See code point methods from Java SE docs
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论