英文:
Powershell: Spilt a string to an array of sub-strings such that none when encoded to Uft8 has a byte length > n
问题
如何将一个字符串分割成子字符串数组,使得每个子字符串在转换为UTF-8编码时不带有多余的null字符,且其字节长度不超过n?
n的值始终大于或等于5,因此一个4字节的编码字符加上null字符是可以容纳的。
我目前想到的方法有两种:
- 分割字符串,进行编码,检测长度,如果太大则用更小的分割长度重试。
- 进行编码,遍历字符串,使用掩码并计算字节长度,同时记录边界位置,然后手动分割字符串。
是否有更好的方法?
输出的数组也可以是UTF-8字节的数组,而不是字符串,如果这样做更容易。
我已经了解码点、编码、代理项等相关知识。这个问题具体涉及UTF-8编码的字节长度。
英文:
How can I spilt a string to an array of sub-strings such that no sub-string when encoded to Uft8 with a trailing null, has a byte length > n?
n is always >= 5, so a 4 byte encoded character + null will fit.
The only ways I can think of at the moment, is:
- Split, encode, test the length, and if too big repeat with a smaller split.
- Encode, walk, mask and count the bytes keeping track of boundaries, and split manually
Is there a better way?
The output array could also be an array of utf8 bytes rather than strings if that is easier.
I already know about codepoints, encodings, surrogates and all that. This is specifically about the byte length of the utf8 encoding.
答案1
得分: 4
# 提取基于UTF-8字节序列长度的子字符串
# 从输入字符串的每个字符的Unicode码点或代理对中抽取子字符串,该操作基于每个字符的UTF-8字节序列长度
# 是否字符和后续字符形成代理对(这意味着码点大于`U+FFFF`,这又意味着`4`个字节)
# 使用`System.Collections.Generic.List`1`实例将所有子字符串收集到列表中
$str = '1234abc€défg👍'
$maxChunkLen = 4 # 最大字节数,不包括尾随的NUL
$chunks = [System.Collections.Generic.List[string]]::new()
$chunkByteCount = 0
$chunkStartNdx = 0
for ($i = 0; $i -lt $str.Length; ++$i) {
$codePoint = [int] $str[$i]
$isSurrogatePair = [char]::IsSurrogatePair($str, $i)
$thisByteCount =
if ($isSurrogatePair) { 4 }
elseif ($codePoint -ge 0x800) { 3 }
elseif ($codePoint -ge 0x80) { 2 }
else { 1 }
if ($chunkByteCount + $thisByteCount -gt $maxChunkLen) {
$chunks.Add($str.Substring($chunkStartNdx, ($i - $chunkStartNdx)) + "`0")
$chunkStartNdx = $i
$chunkByteCount = $thisByteCount
}
else {
$chunkByteCount += $thisByteCount
}
if ($isSurrogatePair) { ++$i }
}
if ($chunkStartNdx -lt $str.Length) { $chunks.Add($str.Substring($chunkStartNdx) + "`0") }
$chunks
英文:
<!-- language-all: sh -->
Note:
- This answer creates a list of (.NET) substrings of the input string, by partitioning the input string into substrings whose UTF-8 byte representation doesn't exceed a given byte count.
- To create a list of these UTF-8 byte representations instead, i.e. byte arrays, see mclayton's helpful answer.
The following:
-
iterates over the input string character by character or surrogate pair by surrogate pair
- Note that .NET
[char]
instances, which .NET[string]
instances are composed of, are unsigned 16-bit Unicode code units, which can therefore only directly encode Unicode glyphs with code points up toU+FFFF
(glyphs in the so-called BMP (Basic Multilingual Plane), and require a (surrogate) pair of code units to encode Unicode characters outside that plane, i.e. those with code points greater thanU+FFFF
, a range of characters that notably includes emoji, such as👍
.
- Note that .NET
-
extracts substrings based on their UTF-8 byte-sequence length, inferred from each single character's Unicode code point / whether a character and the subsequent one form a surrogate pair (which implies a code point greater than
U+FFFF
, which in turn implies4
bytes), based on this Wikipedia table.
<sup>Tip of the hat to mclayton for pointing out that keeping track of indices in the original string combined with.Substring()
calls is sufficient to extract chunks - no need for a string builder.</sup> -
uses a
System.Collections.Generic.List`1
instance to collect all chunks in a list.
# Sample string
$str = '1234abc€défg👍'
$maxChunkLen = 4 # the max. *byte* count, excluding the trailing NUL
$chunks = [System.Collections.Generic.List[string]]::new()
$chunkByteCount = 0
$chunkStartNdx = 0
for ($i = 0; $i -lt $str.Length; ++$i) {
$codePoint = [int] $str[$i]
$isSurrogatePair = [char]::IsSurrogatePair($str, $i)
# Note: A surrogate pair encoded as UTF-8 is a *single* non-BMP
# character encoded in 4 bytes, not 2 3-byte *surrogates*.
$thisByteCount =
if ($isSurrogatePair) { 4 }
elseif ($codePoint -ge 0x800) { 3 }
elseif ($codePoint -ge 0x80) { 2 }
else { 1 }
if ($chunkByteCount + $thisbyteCount -gt $maxChunkLen) {
# Including this char. / surrogate pair would make the chunk too long.
# Add the current chunk plus a trailing NUL to the list...
$chunks.Add($str.Substring($chunkStartNdx, ($i - $chunkStartNdx)) + "`0")
# ... and start a new chunk with this char. / surrogate pair.
$chunkStartNdx = $i
$chunkByteCount = $thisByteCount
}
else {
# Still fits into the current chunk.
$chunkByteCount += $thisByteCount
}
if ($isSurrogatePair) { ++$i }
}
# Add a final chunk to the list, if present.
if ($chunkStartNdx -lt $str.Length) { $chunks.Add($str.Substring($chunkStartNdx) + "`0") }
# Output the resulting chunks
$chunks
Output (annotated):
1234 # 1 + 1 + 1 + 1 bytes (+ NUL)
abc # 1 + 1 + 1 bytes (+ NUL)
€d # 3 + 1 bytes (+ NUL)
éfg # 2 + 1 + 1 bytes (+ NUL)
👍 # 4 bytes (+ NUL)
Note that all chunks except the first one have fewer than 4 characters (the byte-count limit, excluding the NUL), due to containing multi-byte-as-UTF-8 characters and/or due to the next character being such a character and therefore not fitting into the current chunk.
答案2
得分: 4
好的,这是你要求的代码部分的中文翻译:
# 从utf8字节数组中提取字节块的*位置*
# 以便没有多字节代码点跨越块,并且
# 所有块最多为$MaxLen字节
#
# 注意 - 假设$Utf8Bytes是*有效的*utf8字节数组,因此可能需要错误处理,如果您期望传入无效数据。
function Get-UTF8ChunkPositions
{
param( [byte[]] $Utf8Bytes, [int] $MaxLen )
# 存储每个块的起始位置
$startPositions = [System.Collections.Generic.List[int]]::new();
$i = 0;
while( $i -lt $Utf8Bytes.Length )
{
# 记住当前块的起始位置
$startPositions.Add($i);
# 跳过当前块的末尾,乐观地假设我们不会落在
# 多字节代码点的中间(但我们将在一分钟内处理它)
$i += $MaxLen;
# 如果我们超出了数组的末尾,那么我们完成了,因为当前块之后没有更多的块,所以不再需要记录起始位置
if( $i -ge $Utf8Bytes.Length )
{
break;
}
# 如果我们位于多字节代码点的中间,请回溯直到不再位于中间。
# 然后我们位于*下一个*块的开头,并且块长度绝对
# 小于$MaxLen
#
# 0xC0 = [Convert]::ToInt32("11000000", 2);
# 0x80 = [Convert]::ToInt32("10000000", 2);
while( ($Utf8Bytes[$i] -band 0xC0) -eq 0x80 )
{
$i -= 1;
}
}
# 现在我们知道了所有的起始位置,所以将它们转换为范围。
# (我们将添加一个虚拟项目来帮助构建最后一个范围)
$startPositions.Add($utf8Bytes.Length);
for( $i = 1; $i -lt $startPositions.Count; $i++ )
{
[PSCustomObject] @{
"Start" = $startPositions[$i-1];
"Length" = $startPositions[$i] - $startPositions[$i-1];
};
}
}
英文:
Ok so first off, everything I know about the binary encoding process for UTF-8 I learned just now from the UTF-8 Wikipedia page, so caveat emptor :-). That said, the code below seems to give the correct result for a bunch of test data compared to @mklement0's answer (which I have no doubt is correct), so maybe there's some mileage in it...
Happy to hear if there's any howlers in it though - I think it should work in principle even if the implementation below is wrong somewhere :-).
In any case, the core function is the one below, which returns the positions of the chunks in a utf-8 encoded byte array. I figured once you know the positions you might want to use the bytes in-place (e.g. with Stream.Write(byte[] buffer, int offset, int count)
, or similar) so I've avoided extracting a copy of the chunks into a new list, but it's pretty easy to use the output to do that if required (see further down).
# extracts the *positions* of chunks of bytes in a ut8 byte array
# such that no multi-byte codepoints are split across chunks, and
# all chunks are a maximum of $MaxLen bytes
#
# note - assumes $Utf8Bytes is a *valid* utf8 byte array, so may
# need error handling if you expect invalid data to be passed in.
function Get-UTF8ChunkPositions
{
param( [byte[]] $Utf8Bytes, [int] $MaxLen )
# from https://en.wikipedia.org/wiki/UTF-8
#
# Code point ↔ UTF-8 conversion
# -----------------------------
# First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4 Code points
# U+0000 U+007F 0xxxxxxx 128
# U+0080 U+07FF 110xxxxx 10xxxxxx 1920
# U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 61440
# U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 1048576
# stores the start position of each chunk
$startPositions = [System.Collections.Generic.List[int]]::new();
$i = 0;
while( $i -lt $Utf8Bytes.Length )
{
# remember the start position for the current chunk
$startPositions.Add($i);
# jump past the end of the current chunk, optimistically assuming we won't land in
# the middle of a multi-byte codepoint (but we'll deal with that in a minute)
$i += $MaxLen;
# if we've gone past the end of the array then we're done as there's no more
# chunks after the current one, so there's no more start positions to record
if( $i -ge $Utf8Bytes.Length )
{
break;
}
# if we're in the middle of a multi-byte codepoint, backtrack until we're not.
# we're then at the start of the *next* chunk, and the chunk length is definitely
# smaller then $MaxLen
#
# 0xC0 = [Convert]::ToInt32("11000000", 2);
# 0x80 = [Convert]::ToInt32("10000000", 2);
while( ($Utf8Bytes[$i] -band 0xC0) -eq 0x80 )
{
$i -= 1;
}
}
# we know all the start positions now, so turn them into ranges.
# (we'll add a dummy item to help build the last range)
$startPositions.Add($utf8Bytes.Length);
for( $i = 1; $i -lt $startPositions.Count; $i++ )
{
[PSCustomObject] @{
"Start" = $startPositions[$i-1];
"Length" = $startPositions[$i] - $startPositions[$i-1];
};
}
}
This function takes advantage of the fact that a valid utf8 byte stream is self-synchronizing, which means in practice that we don't have to walk through every single byte in the stream - we can jump anywhere we like and find the start of an encoded codepoint by finding the nearest byte that starts 00xxxxxx
, 01xxxxxx
, or 11xxxxxx
(or equivalently, doesn't start 10xxxxxx
), and that's the start of the next chunk.
For example, if we jumped n
bytes from the start of the current chunk and found a byte starting with 10xxxxxx
:
n-2 n-1 n n+1
... 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx ...
^^^^^^^^
then we backtrack to n-2
as the start of the next chunk (so the end of the current chunk is logically one byte before that at n-3
):
n-2 n-1 n n+1
... 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx ...
^^^^^^^^
Example:
# set up some test data
$str = "1234abc€défg👍abüb";
$utf8 = [System.Text.Encoding]::UTF8.GetBytes($str);
"$utf8"
# 49 50 51 52 97 98 99 226 130 172 100 195 169 102 103 240 159 145 141 97 98 195 188 98
$maxLen = 5 - 1; # exclude NUL terminator
$positions = Get-UTF8ChunkPositions -Utf8Bytes $utf8 -MaxLen $maxLen;
$positions
# Start Length
# ----- ------
# 0 4
# 4 3
# 7 4
# 11 4
# 15 4
# 19 4
# 23 1
Chunks
If you actually want the individual chunked byte arrays with a null terminator you can convert the positions into chunks like this:
$chunks = $positions | foreach-object {
$chunk = new-object byte[] ($_.Length + 1); # allow room for NUL terminator
[Array]::Copy($utf8, $_.Start, $chunk, 0, $_.Length);
$chunk[$chunk.Length - 1] = 0; # utf-8 encoded NUL is 0
@(, $chunk); # send the chunk to the pipeline
};
Strings
Once you've got the individual null-terminated utf-8 encoded chunks, you can revert them back to the null-terminated strings like this if you really want them:
$strings = $chunks | foreach-object { [System.Text.Encoding]::UTF8.GetString($_) }
$strings
Performance
In terms of performance, this seems to scale reasonably well, even if you have to specifically encode the string to a utf8 byte array just to call it. It also performs a whole lot faster with large numbers for $MaxLen
as it gets to jump longer distances for each chunk...
A rough test on my machine:
$sample = "1234abc€défg👍abüb" * 1000000;
$sample / 1mb;
# 17
Measure-Command {
$utf8 = [System.Text.Encoding]::UTF8.GetBytes($sample);
$maxlen = 1024;
$positions = Get-UTF8ChunkPositions $utf8 $maxlen;
$chunks = $positions | foreach-object {
$chunk = [byte[]]::new($_.Length + 1); # allow room for NUL terminator
[Array]::Copy($utf8, $_.Start, $chunk, 0, $_.Length);
$chunk[$chunk.Length - 1] = 0; # utf-8 encoded NUL is 0
@(, $chunk); # send the chunk to the pipeline
};
};
# TotalMilliseconds : 7986.131
Although if performance is your main concern then maybe PowerShell isn't the right choice for you :-)...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论