如何在Java中使用Unicode字符进行字符串填充

huangapple go评论77阅读模式
英文:

How to pad Strings with Unicode characters in Java

问题

// 我将右填充添加到字符串以以表格格式输出。

for (String[] tuple : testData) {
  System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}

// 结果如下(随机测试数据):

znZfmOEQ0Gb68taaNU6HY21lvo       -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J                 -> lHJ5r7YDV0jTL
NxtHP                            -> odvPJklwIzZZ
NX2scXjl5dxWmer                  -> wPDlKCKllVKk
x2HKsSHCqDQ                      -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI                  -> 05MHjvTOxlxq1bvQ8RGe

// 当存在多字节Unicode字符时,此方法无效:

0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO         -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS                -> SZX
WtP9t                            -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS                       -> KI
a71?TZ💣🧜🕓ws5J              -> b8A

如您所见对齐有问题

我的想法是计算字符串长度与使用的字节数之间的差异然后将其用于偏移填充类似于这样

int correction = tuple[0].getBytes().length - tuple[0].length();

然后我会将填充从32个字符调整为 `32 + correction`。然而这也没有起作用

以下是我的测试代码使用 [emoji-java](https://github.com/vdurmont/emoji-java) 但行为应该可以在任何Unicode字符下复现):

import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;

public class Test {

  public static void main(String[] args) {
    // 创建随机测试数据
    String[][] testData = new String[15][2];
    for (String[] tuple : testData) {
      tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
      tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
    }

    // 添加一些表情符号
    Collection<Emoji> all = EmojiManager.getAll();
    for (String[] tuple : testData) {
      for (int i = 1; i < tuple[0].length(); i++) {
        if (Math.random() > 0.90) {
          Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
          tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
        }
      }
    }

    // 输出
    for (String[] tuple : testData) {
      System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
    }
  }
}
英文:

I add right padding to a String to output it in a table format.

for (String[] tuple : testData) {
System.out.format(&quot;%-32s -&gt; %s\n&quot;, tuple[0], tuple[1]);
}

The result looks like this (random test data):

znZfmOEQ0Gb68taaNU6HY21lvo       -&gt; Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J                 -&gt; lHJ5r7YDV0jTL
NxtHP                            -&gt; odvPJklwIzZZ
NX2scXjl5dxWmer                  -&gt; wPDlKCKllVKk
x2HKsSHCqDQ                      -&gt; RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI                  -&gt; 05MHjvTOxlxq1bvQ8RGe

This approach does not work when there are multi-byte unicode characters:

0OZot&#127464;&#127475;ivbyG&#129527;hZM1FI&#128097;wNhn6r6cC -&gt; OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO         -&gt; gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh&#127474;&#127482;b0cXkLWkS                -&gt; SZX
WtP9t                            -&gt; Q0wWOeY3W66mM5rcQQYKpG
va4d&#127863;u8SS                       -&gt; KI
a71?⚖TZ&#128163;&#129500;‍♀&#128339;ws5J              -&gt; b8A

As you can see, the alignment is off.

My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:

int correction = tuple[0].getBytes().length - tuple[0].length();

And then instead of padding to 32 chars, I would pad to 32 + correction. However, this didn't work either.

Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):

import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;
public class Test {
public static void main(String[] args) {
// create random test data
String[][] testData = new String[15][2];
for (String[] tuple : testData) {
tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
}
// add some emojis
Collection&lt;Emoji&gt; all = EmojiManager.getAll();
for (String[] tuple : testData) {
for (int i = 1; i &lt; tuple[0].length(); i++) {
if (Math.random() &gt; 0.90) {
Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
}
}
}
// output
for (String[] tuple : testData) {
System.out.format(&quot;%-32s -&gt; %s\n&quot;, tuple[0], tuple[1]);
}
}
}

答案1

得分: 2

以下是翻译好的部分:

String类报告了错误的长度

String类使用char,这些是Unicode代码点的16位整数。问题在于,并不是所有的代码点都适合16位,只有来自基本多文种平面(BMP)的代码点适合这些charStringlength()方法返回char的数量,而不是代码点的数量。

现在StringcodePointCount方法在这种情况下可能有所帮助:它计算给定索引范围内的代码点数量。因此,将string.length()作为第二个参数提供给该方法,可以返回代码点的总数。

合并字符

然而,还有另一个问题。例如,🇨🇳 中国国旗由两个Unicode代码点组成:区域指示符号字母C(🇨,U+1F1E8)和N(🇳,U+1F1F3)。这两个代码点被合并成中国的国旗。使用codePointCount方法无法解决这个问题。

区域指示符号字母似乎 是一个特殊情况。这两个字符可以合并成一个国旗。我不知道有一种标准的方法来实现你想要的。你可能需要手动考虑这一点。

我写了一个小程序来获取字符串的长度。

static int length(String str) {
    String a = "\uD83C\uDDE6";
    String z = "\uD83C\uDDFF";

    Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
    Matcher m = p.matcher(str);
    int count = 0;
    while (m.find()) {
        count++;
    }
    return str.codePointCount(0, str.length()) - count;
}
英文:

There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).

The String class reports an incorrect length

The String class works with chars, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those chars. String's length() method returns the number of chars, not the number of code points.

Now String's codePointCount method may help in this case: it counts the number of code points in the given index range. So providing string.length() as second argument to the method returns the total count of code points.

Combining characters

However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points: the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount method.

The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.

I've written a small program to get the length of a string.

static int length(String str) {
    String a = &quot;\uD83C\uDDE6&quot;;
    String z = &quot;\uD83C\uDDFF&quot;;

    Pattern p = Pattern.compile(&quot;[&quot; + a + &quot;-&quot; + z + &quot;]{2}&quot;);
    Matcher m = p.matcher(str);
    int count = 0;
    while (m.find()) {
        count++;
    }
    return str.codePointCount(0, str.length()) - count;
}

答案2

得分: 1

根据@Xehpuk在链接的问题中所讨论的评论,以及在kotlinlang.org上的这个讨论以及Daniel Lemire在这篇博文中所述,以下似乎是正确的:

问题在于Java的String类将字符表示为UTF-16字符。这意味着任何用多于16位表示的Unicode字符都会保存为2个单独的Char值。许多String内部的函数忽略了这个事实,例如,String.length不会返回Unicode字符的数量,它返回的是String中16位字符的数量,一些表情符号计为2个字符。

然而,行为似乎是与具体实现相关的。

正如David在他的帖子中提到的,你可以尝试以下方法来获取正确的长度:

tuple.codePointCount(0, tuple.length())

请参阅Java SE文档中的代码点方法

英文:

As is discussed by the comments in the question linked to by @Xehpuk, in this discussion on kotlinlang.org as well as in this blog post by Daniel Lemire the following seems to be correct:

> The problem is that the java String class represents characters as
> UTF-16 characters. This means any unicode character that is
> represented by more than 16 bits is saved as 2 separate Char values.
> This fact is ignored by many of the functions within String, eg.
> String.lenght does not return the number of unicode characters, it
> returns the number of 16bit characters within the String, some emoji
> counting for 2 characters.

The behaviour, however, seems to be implementation-specific.

As David mentions in his post you could try the following to get the correct lenght:

tuple.codePointCount(0, tuple.length())

See code point methods from Java SE docs

huangapple
  • 本文由 发表于 2020年10月17日 07:10:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/64397528.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定