Powershell: Spilt a string to an array of sub-strings such that none when encoded to Uft8 has a byte length > n

huangapple go评论60阅读模式
英文:

Powershell: Spilt a string to an array of sub-strings such that none when encoded to Uft8 has a byte length > n

问题

如何将一个字符串分割成子字符串数组,使得每个子字符串在转换为UTF-8编码时不带有多余的null字符,且其字节长度不超过n?

n的值始终大于或等于5,因此一个4字节的编码字符加上null字符是可以容纳的。

我目前想到的方法有两种:

  • 分割字符串,进行编码,检测长度,如果太大则用更小的分割长度重试。
  • 进行编码,遍历字符串,使用掩码并计算字节长度,同时记录边界位置,然后手动分割字符串。

是否有更好的方法?

输出的数组也可以是UTF-8字节的数组,而不是字符串,如果这样做更容易。

我已经了解码点、编码、代理项等相关知识。这个问题具体涉及UTF-8编码的字节长度。

英文:

How can I spilt a string to an array of sub-strings such that no sub-string when encoded to Uft8 with a trailing null, has a byte length > n?

n is always >= 5, so a 4 byte encoded character + null will fit.

The only ways I can think of at the moment, is:

  • Split, encode, test the length, and if too big repeat with a smaller split.
  • Encode, walk, mask and count the bytes keeping track of boundaries, and split manually

Is there a better way?

The output array could also be an array of utf8 bytes rather than strings if that is easier.

I already know about codepoints, encodings, surrogates and all that. This is specifically about the byte length of the utf8 encoding.

答案1

得分: 4

# 提取基于UTF-8字节序列长度的子字符串
# 从输入字符串的每个字符的Unicode码点或代理对中抽取子字符串,该操作基于每个字符的UTF-8字节序列长度
# 是否字符和后续字符形成代理对(这意味着码点大于`U+FFFF`,这又意味着`4`个字节)
# 使用`System.Collections.Generic.List`1`实例将所有子字符串收集到列表中
$str = '1234abc€défg👍'

$maxChunkLen = 4 # 最大字节数,不包括尾随的NUL

$chunks = [System.Collections.Generic.List[string]]::new()
$chunkByteCount = 0
$chunkStartNdx = 0
for ($i = 0; $i -lt $str.Length; ++$i) {
  $codePoint = [int] $str[$i]
  $isSurrogatePair = [char]::IsSurrogatePair($str, $i)
  $thisByteCount = 
    if ($isSurrogatePair) { 4 } 
    elseif ($codePoint -ge 0x800) { 3 } 
    elseif ($codePoint -ge 0x80) { 2 } 
    else { 1 }
  if ($chunkByteCount + $thisByteCount -gt $maxChunkLen) {
    $chunks.Add($str.Substring($chunkStartNdx, ($i - $chunkStartNdx)) + "`0")
    $chunkStartNdx = $i
    $chunkByteCount = $thisByteCount
  }
  else {
    $chunkByteCount += $thisByteCount
  }
  if ($isSurrogatePair) { ++$i }
}
if ($chunkStartNdx -lt $str.Length) { $chunks.Add($str.Substring($chunkStartNdx) + "`0") }
$chunks
英文:

<!-- language-all: sh -->

Note:

  • This answer creates a list of (.NET) substrings of the input string, by partitioning the input string into substrings whose UTF-8 byte representation doesn't exceed a given byte count.
  • To create a list of these UTF-8 byte representations instead, i.e. byte arrays, see mclayton's helpful answer.

The following:

  • iterates over the input string character by character or surrogate pair by surrogate pair

    • Note that .NET [char] instances, which .NET [string] instances are composed of, are unsigned 16-bit Unicode code units, which can therefore only directly encode Unicode glyphs with code points up to U+FFFF (glyphs in the so-called BMP (Basic Multilingual Plane), and require a (surrogate) pair of code units to encode Unicode characters outside that plane, i.e. those with code points greater than U+FFFF, a range of characters that notably includes emoji, such as &#128077;.
  • extracts substrings based on their UTF-8 byte-sequence length, inferred from each single character's Unicode code point / whether a character and the subsequent one form a surrogate pair (which implies a code point greater than U+FFFF, which in turn implies 4 bytes), based on this Wikipedia table.
    <sup>Tip of the hat to mclayton for pointing out that keeping track of indices in the original string combined with .Substring() calls is sufficient to extract chunks - no need for a string builder.</sup>

  • uses a System.Collections.Generic.List`1 instance to collect all chunks in a list.

# Sample string
$str = &#39;1234abc€d&#233;fg&#128077;&#39;

$maxChunkLen = 4 # the max. *byte* count, excluding the trailing NUL

$chunks = [System.Collections.Generic.List[string]]::new()
$chunkByteCount = 0
$chunkStartNdx = 0
for ($i = 0; $i -lt $str.Length; ++$i) {
  $codePoint = [int] $str[$i]
  $isSurrogatePair = [char]::IsSurrogatePair($str, $i)
  # Note: A surrogate pair encoded as UTF-8 is a *single* non-BMP
  #       character encoded in 4 bytes, not 2 3-byte *surrogates*.
  $thisByteCount = 
    if ($isSurrogatePair) { 4 } 
    elseif ($codePoint -ge 0x800) { 3 } 
    elseif ($codePoint -ge 0x80) { 2 } 
    else { 1 }
  if ($chunkByteCount + $thisbyteCount -gt $maxChunkLen) {
    # Including this char. / surrogate pair would make the chunk too long.
    # Add the current chunk plus a trailing NUL to the list...
    $chunks.Add($str.Substring($chunkStartNdx, ($i - $chunkStartNdx)) + &quot;`0&quot;)
    # ... and start a new chunk with this char. / surrogate pair.
    $chunkStartNdx = $i
    $chunkByteCount = $thisByteCount
  }
  else {
    # Still fits into the current chunk.
    $chunkByteCount += $thisByteCount
  }
  if ($isSurrogatePair) { ++$i }
}
# Add a final chunk to the list, if present.
if ($chunkStartNdx -lt $str.Length) { $chunks.Add($str.Substring($chunkStartNdx) + &quot;`0&quot;) }

# Output the resulting chunks
$chunks

Output (annotated):

1234 # 1 + 1 + 1 + 1 bytes (+ NUL)
abc  # 1 + 1 + 1 bytes (+ NUL)
€d   # 3 + 1 bytes (+ NUL)
&#233;fg  # 2 + 1 + 1 bytes (+ NUL)
&#128077;   # 4 bytes (+ NUL)

Note that all chunks except the first one have fewer than 4 characters (the byte-count limit, excluding the NUL), due to containing multi-byte-as-UTF-8 characters and/or due to the next character being such a character and therefore not fitting into the current chunk.

答案2

得分: 4

好的,这是你要求的代码部分的中文翻译:

# 从utf8字节数组中提取字节块的*位置*
# 以便没有多字节代码点跨越块,并且
# 所有块最多为$MaxLen字节
#
# 注意 - 假设$Utf8Bytes是*有效的*utf8字节数组,因此可能需要错误处理,如果您期望传入无效数据。
function Get-UTF8ChunkPositions
{
    param( [byte[]] $Utf8Bytes, [int] $MaxLen )

    # 存储每个块的起始位置
    $startPositions = [System.Collections.Generic.List[int]]::new();

    $i = 0;
    while( $i -lt $Utf8Bytes.Length )
    {

        # 记住当前块的起始位置
        $startPositions.Add($i);

        # 跳过当前块的末尾,乐观地假设我们不会落在
        # 多字节代码点的中间(但我们将在一分钟内处理它)
        $i += $MaxLen;

        # 如果我们超出了数组的末尾,那么我们完成了,因为当前块之后没有更多的块,所以不再需要记录起始位置
        if( $i -ge $Utf8Bytes.Length )
        {
            break;
        }

        # 如果我们位于多字节代码点的中间,请回溯直到不再位于中间。
        # 然后我们位于*下一个*块的开头,并且块长度绝对
        # 小于$MaxLen
        #
        # 0xC0 = [Convert]::ToInt32("11000000", 2);
        # 0x80 = [Convert]::ToInt32("10000000", 2);
        while( ($Utf8Bytes[$i] -band 0xC0) -eq 0x80 )
        {
            $i -= 1;
        }

    }

    # 现在我们知道了所有的起始位置,所以将它们转换为范围。
    # (我们将添加一个虚拟项目来帮助构建最后一个范围)
    $startPositions.Add($utf8Bytes.Length);
    for( $i = 1; $i -lt $startPositions.Count; $i++ )
    {
        [PSCustomObject] @{
            "Start"  = $startPositions[$i-1];
            "Length" = $startPositions[$i] - $startPositions[$i-1];
        };
    }

}
英文:

Ok so first off, everything I know about the binary encoding process for UTF-8 I learned just now from the UTF-8 Wikipedia page, so caveat emptor :-). That said, the code below seems to give the correct result for a bunch of test data compared to @mklement0's answer (which I have no doubt is correct), so maybe there's some mileage in it...

Happy to hear if there's any howlers in it though - I think it should work in principle even if the implementation below is wrong somewhere :-).

In any case, the core function is the one below, which returns the positions of the chunks in a utf-8 encoded byte array. I figured once you know the positions you might want to use the bytes in-place (e.g. with Stream.Write(byte[] buffer, int offset, int count), or similar) so I've avoided extracting a copy of the chunks into a new list, but it's pretty easy to use the output to do that if required (see further down).

# extracts the *positions* of chunks of bytes in a ut8 byte array
# such that no multi-byte codepoints are split across chunks, and
# all chunks are a maximum of $MaxLen bytes
#
# note - assumes $Utf8Bytes is a *valid* utf8 byte array, so may
# need error handling if you expect invalid data to be passed in.
function Get-UTF8ChunkPositions
{
    param( [byte[]] $Utf8Bytes, [int] $MaxLen )

    # from https://en.wikipedia.org/wiki/UTF-8
    #
    # Code point ↔ UTF-8 conversion
    # -----------------------------
    # First code point  Last code point  Byte 1    Byte 2    Byte 3    Byte 4    Code points
    # U+0000            U+007F           0xxxxxxx                                        128
    # U+0080            U+07FF           110xxxxx  10xxxxxx                             1920
    # U+0800            U+FFFF           1110xxxx  10xxxxxx  10xxxxxx                  61440
    # U+10000           U+10FFFF         11110xxx  10xxxxxx  10xxxxxx  10xxxxxx      1048576

    # stores the start position of each chunk
    $startPositions = [System.Collections.Generic.List[int]]::new();

    $i = 0;
    while( $i -lt $Utf8Bytes.Length )
    {

        # remember the start position for the current chunk
        $startPositions.Add($i);

        # jump past the end of the current chunk, optimistically assuming we won&#39;t land in
        # the middle of a multi-byte codepoint (but we&#39;ll deal with that in a minute)
        $i += $MaxLen;

        # if we&#39;ve gone past the end of the array then we&#39;re done as there&#39;s no more
        # chunks after the current one, so there&#39;s no more start positions to record
        if( $i -ge $Utf8Bytes.Length )
        {
            break;
        }

        # if we&#39;re in the middle of a multi-byte codepoint, backtrack until we&#39;re not.
        # we&#39;re then at the start of the *next* chunk, and the chunk length is definitely
        # smaller then $MaxLen
        #
        # 0xC0 = [Convert]::ToInt32(&quot;11000000&quot;, 2);
        # 0x80 = [Convert]::ToInt32(&quot;10000000&quot;, 2);
        while( ($Utf8Bytes[$i] -band 0xC0) -eq 0x80 )
        {
            $i -= 1;
        }

    }

    # we know all the start positions now, so turn them into ranges.
    # (we&#39;ll add a dummy item to help build the last range)
    $startPositions.Add($utf8Bytes.Length);
    for( $i = 1; $i -lt $startPositions.Count; $i++ )
    {
        [PSCustomObject] @{
            &quot;Start&quot;  = $startPositions[$i-1];
            &quot;Length&quot; = $startPositions[$i] - $startPositions[$i-1];
        };
    }

}

This function takes advantage of the fact that a valid utf8 byte stream is self-synchronizing, which means in practice that we don't have to walk through every single byte in the stream - we can jump anywhere we like and find the start of an encoded codepoint by finding the nearest byte that starts 00xxxxxx, 01xxxxxx, or 11xxxxxx (or equivalently, doesn't start 10xxxxxx), and that's the start of the next chunk.

For example, if we jumped n bytes from the start of the current chunk and found a byte starting with 10xxxxxx:

      n-2       n-1        n        n+1
... 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx ...
                        ^^^^^^^^

then we backtrack to n-2 as the start of the next chunk (so the end of the current chunk is logically one byte before that at n-3):

      n-2       n-1        n        n+1
... 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx ...
                        ^^^^^^^^

Example:

# set up some test data
$str  = &quot;1234abcd&#233;fg&#128077;ab&#252;b&quot;;

$utf8 = [System.Text.Encoding]::UTF8.GetBytes($str);
&quot;$utf8&quot;
# 49 50 51 52 97 98 99 226 130 172 100 195 169 102 103 240 159 145 141 97 98 195 188 98

$maxLen = 5 - 1; # exclude NUL terminator
$positions = Get-UTF8ChunkPositions -Utf8Bytes $utf8 -MaxLen $maxLen;
$positions
# Start Length
# ----- ------
#     0      4
#     4      3
#     7      4
#    11      4
#    15      4
#    19      4
#    23      1

Chunks

If you actually want the individual chunked byte arrays with a null terminator you can convert the positions into chunks like this:

$chunks = $positions | foreach-object {
    $chunk = new-object byte[] ($_.Length + 1); # allow room for NUL terminator
    [Array]::Copy($utf8, $_.Start, $chunk, 0, $_.Length);
    $chunk[$chunk.Length - 1] = 0; # utf-8 encoded NUL is 0
    @(, $chunk); # send the chunk to the pipeline
};

Strings

Once you've got the individual null-terminated utf-8 encoded chunks, you can revert them back to the null-terminated strings like this if you really want them:

$strings = $chunks | foreach-object { [System.Text.Encoding]::UTF8.GetString($_) }
$strings

Performance

In terms of performance, this seems to scale reasonably well, even if you have to specifically encode the string to a utf8 byte array just to call it. It also performs a whole lot faster with large numbers for $MaxLen as it gets to jump longer distances for each chunk...

A rough test on my machine:

$sample = &quot;1234abcd&#233;fg&#128077;ab&#252;b&quot; * 1000000;
$sample / 1mb;
# 17

Measure-Command {
    $utf8 = [System.Text.Encoding]::UTF8.GetBytes($sample);
    $maxlen = 1024;
    $positions = Get-UTF8ChunkPositions $utf8 $maxlen;
    $chunks = $positions | foreach-object {
        $chunk = [byte[]]::new($_.Length + 1); # allow room for NUL terminator
        [Array]::Copy($utf8, $_.Start, $chunk, 0, $_.Length);
        $chunk[$chunk.Length - 1] = 0; # utf-8 encoded NUL is 0
        @(, $chunk); # send the chunk to the pipeline
    };
};

# TotalMilliseconds : 7986.131

Although if performance is your main concern then maybe PowerShell isn't the right choice for you :-)...

huangapple
  • 本文由 发表于 2023年3月31日 20:37:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75898613.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定