Python: 如何从字节字符串中解包可变长度的数据?

huangapple go评论59阅读模式
英文:

Python: how to unpack variable-length data from byte string?

问题

这里有一个类似这样的字节字符串:

[lenght1][sequence1][len2][seq2][len3][seq3][len1][seq1]...

其中lengthX是紧跟其后的sequenceX的长度。请注意,这里没有分隔符,所有的“len-data”对都被分组成一组三个(在seq3之后立即跟着下一组的len1)。

我尝试提取所有的序列,但是使用struct.unpack()似乎非常繁琐(或者我不知道如何正确使用它):

loop_start:
  my_len = unpack("<B", content[:1])[0]
  content = content[1:]
  ..获取sequence1
  ..移动字节字符串
  ..重复两次...

有更简单的方法吗?

附:seqX实际上是多字节字符串,如果这很重要。

英文:

There's a byte-string like this:

[lenght1][sequence1][len2][seq2][len3][seq3][len1][seq1]...

where lengthX is the length of the sequenceX following just after that lenghtX. Please note there're no separators at all, and all "len-data" pairs are grouped in a set of three (after seq3 immediately comes len1 of the next group).

I'm trying to extract all sequences, but looks like using struct.unpack() is very cumbersome (or idk how to use it properly):

 loop_start:
   my_len = unpack(&quot;&lt;B&quot;, content[:1])[0]
   content = content[1:]
   ..get sequence1
   ..shift byte-string
   ..repeat two times...

Is there any simpler way?

p.s. seqX is in fact multi-byte string, if it's matter.

答案1

得分: 3

这个数据结构在例如通过套接字发送任意数据时非常有用。使用分隔符可能会因为歧义而有问题 - 例如,如果使用STX/ETX,如果[真实]数据包含这些标记中的任何一个,可能会出现问题。

通过长度/数据对发送数据消除了歧义。唯一需要发生的是客户端和服务器需要就传输的长度值的格式达成一致(原生、小端或大端)。

这可以通过示例来更好地解释。

我们有一个字符串列表,并构建了一个长度/数据对的字节数组。我们将同意使用原生无符号整数作为前导。我们知道打包的值由4个字节组成。

所以...

from struct import pack, unpack

strings = [
    'To be, or not to be: that is the question',
    'All the world\'s a stage, and all the men and women merely players',
    'We are such stuff as dreams are made on',
    'The course of true love never did run smooth',
    'If music be the food of love, play on',
    'Friends, Romans, countrymen, lend me your ears',
    'A horse! a horse! my kingdom for a horse!',
    'Once more unto the breach, dear friends, once more',
    'To thine own self be true',
    'Parting is such sweet sorrow'
]

FMT = '=I' # 原生无符号整数
FMTL = 4 # 标准大小

b = bytearray()

for string in strings:
    bs = string.encode()
    b += pack(FMT, len(bs)) + bs

# 此时我们有一个由长度/数据对组成的字节数组
# 现在让我们解开它

while b:
    length, *_ = unpack(FMT, b[:FMTL])
    print(b[FMTL:length+FMTL].decode())
    b = b[length+FMTL:]

这段代码可以通过适当指定FMT和FMTL来轻松适应任何整数类型。类型'c'在OP的问题中有所提及。这需要以稍微不同的方式处理。

英文:

This data structure is very useful when, for example, sending arbitrary data over a socket. Using separators can be problematic due to ambiguity - e.g, if you use STX/ETX there may be an issue if the [real] data contains the equivalent of either of those markers.

Sending data with length/data pairs removes ambiguity. All that needs to happen is that the client and server need to agree on the format of the length value being transmitted (native, little- big-endian).

This is best explained by example.

We have a list of strings and we build a bytearray of length/data pairs. We'll agree on native unsigned int for the preamble. We know that the packed value is comprised of 4 bytes.

So...

from struct import pack, unpack

strings = [
    &#39;To be, or not to be: that is the question&#39;,
    &#39;All the world\&#39;s a stage, and all the men and women merely players&#39;,
    &#39;We are such stuff as dreams are made on&#39;,
    &#39;The course of true love never did run smooth&#39;,
    &#39;If music be the food of love, play on&#39;,
    &#39;Friends, Romans, countrymen, lend me your ears&#39;,
    &#39;A horse! a horse! my kingdom for a horse!&#39;,
    &#39;Once more unto the breach, dear friends, once more&#39;,
    &#39;To thine own self be true&#39;,
    &#39;Parting is such sweet sorrow&#39;
]

FMT = &#39;=I&#39; # native unsigned int
FMTL = 4 # standard size

b = bytearray()

for string in strings:
    bs = string.encode()
    b += pack(FMT, len(bs)) + bs

# at this point we have a bytearray comprised of length/data pairs
# now let&#39;s unravel it

while b:
    length, *_ = unpack(FMT, b[:FMTL])
    print(b[FMTL:length+FMTL].decode())
    b = b[length+FMTL:]

This code is easily adapted for any integer type by specifying FMT and FMTL appropriately. Type 'c' is hinted at in OP's question. That has to be dealt with in a slightly different manner

答案2

得分: 0

由于没有示例展示数据可能的样子,我提供了自己的示例(length 始终是一个整数,因此占据4个字节)。如果您的字节是大端或小端的,您需要自行查看。我决定使用大端,因为它稍微容易阅读和理解。要更改它,只需将 > 更改为 <
这段代码的工作原理是填充数字并使用最大可用类型(unsigned long long)。由于在大端数字的左侧添加零不会产生任何效果,因此这种方法应该有效。对于小端,您必须将 padding 添加到右侧。
为了防止在不同体系结构下出现问题,其中数据类型可能不同,使用 struct.calcsize() 计算所需填充的大小。

import struct

content = b'\x00\x00\x00\x04\x12\x4a\x13\x10\x00\x00\x00\x03\x01\x00\x01\x00\x00\x00\x05\x00\x00\x00\x00\xff'
out = []

while True:
    if content == b'':
        break
    length = struct.unpack('>I', content[:4])[0]
    pad_len = struct.calcsize('Q') - length
    padding = b'\x00' * pad_len
    value = struct.unpack('>Q', padding + content[4:4+length])[0]
    out.append(value)
    content = content[4+length:]

print(out)

输出:

[306844432, 65537, 255]
英文:

Since there's no example how the data could look like, I provided my own sample (length is always a integer, therefore 4 bytes). If your bytes come in big- or little endian you have to see for yourself. I decided to use big endian since it's a bit easier to read an understand. To change that, you just have to change the &gt; to &lt;.
The code works by padding the number and using the biggest available type (unsigned long long). Since added zeroes on the left side of a big endian number does not have any effect, this approach should work. For little endian, you have to add the padding to the right side.
To prevent problems with different architectures, where data types may differ, the size of the needed padding is calculated using struct.calcsize().

import struct


content = b&#39;\x00\x00\x00\x04\x12\x4a\x13\x10\x00\x00\x00\x03\x01\x00\x01\x00\x00\x00\x05\x00\x00\x00\x00\xff&#39;
out = []

while True:
    if content == b&#39;&#39;:
        break
    length = struct.unpack(&#39;&gt;I&#39;, content[:4])[0]
    pad_len = struct.calcsize(&#39;Q&#39;) - length
    padding = b&#39;\x00&#39; * pad_len
    value = struct.unpack(&#39;&gt;Q&#39;, padding + content[4:4+length])[0]
    out.append(value)
    content = content[4+length:]

print(out)

output:

[306844432, 65537, 255]

huangapple
  • 本文由 发表于 2023年6月15日 19:18:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76481926.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定