2023年2月10日 07:07:29go评论67阅读模式

英文:

Python equivalent of Ruby's Array#pack, how to pack unknown string length and bytes together

问题

I understand that you want assistance with translating the provided text, specifically the code sections, without addressing the content or questions. Here's the translated code sections:

def to_s
    entries = @entries.sort_by(&:name).map do |entry|
      "#{MODE} #{entry.name}".encode(Encoding::ASCII) + [entry.oid].pack(ENTRY_FORMAT)
    end
    entries.join
end

def to_s(self):
    entries = sorted(self.entries, key=lambda x: x.name)
    entries = [f"{self.MODE} {entry.name}".encode() + entry.oid for entry in entries]
    packed_entries = b"".join(pack("!s40s", entry) for entry in entries)
    return packed_entries

Please note that I have translated the code sections, and they should now be in Chinese. If you have any further translation needs or questions, feel free to ask.

英文:

I am working my way through the book "Building Git", which goes through building Git with Ruby. I decided to write it in python while still following along in the book.

The author uses a function defined in ruby Array#pack to pack a git tree object. Git uses binary representation for the 40 character blob hash to reduce it to 20 bytes. In the authors words:

> Putting everything together, this generates a string for each entry consisting of the mode 100644,
a space, the filename, a null byte, and then twenty bytes for the object ID. Ruby’s Array#pack
supports many more data encodings and is very useful for generating binary representations of
values. If you wanted to, you could implement all the maths for reading pairs of digits from
the object ID and turning each pair into a single byte, but Array#pack is so convenient that I
usually reach for that first.

He uses the following code to implement this:

def to_s
    entries = @entries.sort_by(&amp;:name).map do |entry|
      [&quot;#{ MODE } #{ entry.name }&quot;, entry.oid].pack(ENTRY_FORMAT)
    end

with ENTRY_FORMAT = "Z*H40" and MODE = "100644".
entry is class that has :name and :oid attributes, representing the name and the SHA1 hash of a filename.

The goal is also explained by the author:
> Putting everything together, this generates a string for each entry consisting of the mode 100644,
a space, the filename, a null byte, and then twenty bytes for the object ID. Ruby’s Array#pack
supports many more data encodings and is very useful for generating binary representations of
values. If you wanted to, you could implement all the maths for reading pairs of digits from
the object ID and turning each pair into a single byte, but Array#pack is so convenient that I
usually reach for that first.

And the format "Z*H40" means the following:

> Our usage here consists of two separate encoding instructions:
> - Z*: this encodes the first string, "#{ MODE } #{ entry.name }", as an arbitrary-length null-
padded string, that is, it represents the string as-is with a null byte appended to the end
> - H40: this encodes a string of forty hexadecimal digits, entry.oid, by packing each pair of
digits into a single byte as we saw in Section 2.3.3, “Trees on disk”

I have tried for many hours to replicate this in python using struct.pack and other various methods, but either i am not getting the format correct, or I am just missing something very obvious. In any case, this is what I currently have:

def to_s(self):
      entries = sorted(self.entries, key=lambda x: x.name)

      entries = [f&quot;{self.MODE} {entry.name}&quot; + entry.oid.encode() for entry in entries]
      packed_entries = b&quot;&quot;.join(pack(&quot;!Z*40s&quot;, entry) for entry in entries)

      return packed_entries

but obviously this will give a concat error from bytes() to str().

Traceback (most recent call last):
  File &quot;jit.py&quot;, line 67, in &lt;module&gt;
    database.store(tree)
  File &quot;/home/maslin/jit/pyJit/database.py&quot;, line 12, in store
    string = obj.to_s()
  File &quot;/home/maslin/jit/pyJit/tree.py&quot;, line 40, in to_s
    entries = [f&quot;{self.MODE} {entry.name}&quot; + entry.oid.encode() for entry in entries]
  File &quot;/home/maslin/jit/pyJit/tree.py&quot;, line 40, in &lt;listcomp&gt;
    entries = [f&quot;{self.MODE} {entry.name}&quot; + entry.oid.encode() for entry in entries]
TypeError: can only concatenate str (not &quot;bytes&quot;) to str

So then I tried to keep everything as a string, and tried using struct.pack to format it for me, but it gave me a struct.error: bad char in struct format error.

def to_s(self):
      entries = sorted(self.entries, key=lambda x: x.name)

      entries = [f&quot;{self.MODE} {entry.name}&quot; + entry.oid for entry in entries]
      packed_entries = b&quot;&quot;.join(pack(&quot;!Z*40s&quot;, entry) for entry in entries)

      return packed_entries

And the traceback:

Traceback (most recent call last):
  File &quot;jit.py&quot;, line 67, in &lt;module&gt;
    database.store(tree)
  File &quot;/home/maslin/jit/pyJit/database.py&quot;, line 12, in store
    string = obj.to_s()
  File &quot;/home/maslin/jit/pyJit/tree.py&quot;, line 41, in to_s
    packed_entries = b&quot;&quot;.join(pack(&quot;!Z*40s&quot;, entry) for entry in entries)
  File &quot;/home/maslin/jit/pyJit/tree.py&quot;, line 41, in &lt;genexpr&gt;
    packed_entries = b&quot;&quot;.join(pack(&quot;!Z*40s&quot;, entry) for entry in entries)
struct.error: bad char in struct format

How can I pack a string for each entry consisting of the mode 100644,
a space, the filename, a null byte, and then twenty bytes for the object ID?

The author notes above that this can be done by "implementing all the maths for reading pairs of digits from
the object ID and turning each pair into a single byte", so if your solution involves this method, that is also ok.

P.S. this question did not help me nor did this.

P.P.S. ChatGPT was no help as well

答案1

得分: 1

So, I had to look this up. The binary format is simple,

the mode as an ascii byte string,
an ascii space
the filename as a byte string,
a null byte
the sha digest in binary format.

So,

mode = b"100644"

Note, mode is a bytes object. You should probably just have it as a bytes object, but if it is a string, you can just .encode it and it should work with utf-8 since it will only be in the ascii range.

Now, your filename is probably a string, e.g.:

filename = "foo.py"

Now, you didn't say exactly, but I presume your oid is the sha1 hexdigest, i.e. a length 40 string of the digest in hexadecimal. However, you probably should just work with the raw digest. Assuming you consumed

>>> import hashlib
>>> sha = hashlib.sha1(b"print('hello, world')")
>>> sha.hexdigest()
'da8b53bb595a2bd0161f6470a4c3a82f6aa1dc9e'
>>> sha.digest()
b'\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e'

You want just the .digest() directly. You should probably just keep around the hash object and get whatever you need from there, or you can convert back and forth, so if you have the hexdigest, you can get to the binary using:

>>> oid = sha.hexdigest()
>>> oid
'da8b53bb595a2bd0161f6470a4c3a82f6aa1dc9e'
>>> int(oid, 16).to_bytes(20)
b'\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e'

But really, if you are just going to keep one around, I'd keep the binary form, it seems more natural to me to convert to an int then format that in hex:

>>> oid = sha.digest()
>>> oid
b'\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e'
>>> int.from_bytes(oid)
1247667085693497210187506196029418989550863244446
>>> f"{int.from_bytes(oid):x}"
'da8b53bb595a2bd0161f6470a4c3a82f6aa1dc9e'

So, I'm going to assume you have:

>>> import hashlib
>>> mode = b"100644"
>>> filename = "foo.py"
>>> sha = hashlib.sha1(b"print('hello, world')")
>>> oid = sha.digest()

Now, there is no f-string-like interpolation for bytes-literals, but you can use the old-school % based formatting:

>>> entry = b"%s %s\x00%s" % (mode, filename.encode(), oid)
>>> entry
b'100644 foo.py\x00\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e'

Or since this is so simple, just concatenation:

>>> entry = mode + b" " + filename.encode() + b"\x00" + oid
>>> entry
b'100644 foo.py\x00\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e'

Now, you could use struct.pack here, but it's a bit unwieldy. There's no good way to add a space except as a single character. Also, you'd have to dynamically come up with the format string, since there is no format for "arbitrary sized, null-terminated byte string". But you can use an f-string and len(file.encode()) + 1. So it would need to be something like:

>>> struct.pack(f">6sc{len(filename.encode())+1}s20s", mode, b" ", filename.encode(), oid)
b'100644 foo.py\x00\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e'
>>> struct.pack(f">6sc{len(filename.encode())+1}s20s", mode, b" ", filename.encode(), oid) == entry
True

英文:

So, I had to look this up. The binary format is simple,

the mode as an ascii byte string,
an ascii space
the filename as a byte string,
a null byte
the sha digest in binary format.

So,

mode = b&quot;100644&quot;

Note, mode is a bytes object. You should probably just have it as a bytes object,but if it is a string, you can just .encode it and it should work with utf-8 since it will only be in the ascii range.

Now, your filename is probably a string, e.g.:

filename = &quot;foo.py&quot;

&gt;&gt;&gt; import hashlib
&gt;&gt;&gt; sha = hashlib.sha1(b&quot;print(&#39;hello, world&#39;)&quot;)
&gt;&gt;&gt; sha.hexdigest()
&#39;da8b53bb595a2bd0161f6470a4c3a82f6aa1dc9e&#39;
&gt;&gt;&gt; sha.digest()
b&#39;\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e&#39;

You want just the .digest() directly. You should probably just keep around the hash object and get whatever you need from there, or you can convert back and for, so if you have the hexdigest, you can get to the binary using:

&gt;&gt;&gt; oid = sha.hexdigest()
&gt;&gt;&gt; oid
&#39;da8b53bb595a2bd0161f6470a4c3a82f6aa1dc9e&#39;
    &gt;&gt;&gt; int(oid, 16).to_bytes(20)
b&#39;\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e&#39;

Bute really, if you are just going to keep one around, I'd keep the binary form, it seems more natural to me to convert to an int then format that in hex:

&gt;&gt;&gt; oid = sha.digest()
&gt;&gt;&gt; oid
b&#39;\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e&#39;
&gt;&gt;&gt; int.from_bytes(oid)
1247667085693497210187506196029418989550863244446
&gt;&gt;&gt; f&quot;{int.from_bytes(oid):x}&quot;
&#39;da8b53bb595a2bd0161f6470a4c3a82f6aa1dc9e&#39;

So, I'm going to assume you have:

&gt;&gt;&gt; import hashlib
&gt;&gt;&gt; mode = b&quot;100644&quot;
&gt;&gt;&gt; filename = &quot;foo.py&quot;
&gt;&gt;&gt; sha = hashlib.sha1(b&quot;print(&#39;hello, world&#39;)&quot;)
&gt;&gt;&gt; oid = sha.digest()

Now, there is no f-string-like interpolation for bytes-literals, but you can use the old-school % based formatting:

&gt;&gt;&gt; entry = b&quot;%s %s\x00%s&quot; % (mode, filename.encode(), oid)
&gt;&gt;&gt; entry
b&#39;100644 foo.py\x00\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e&#39;

Or since this is so simple, just concatenation:

&gt;&gt;&gt; entry = mode + b&quot; &quot; + filename.encode() + b&quot;\x00&quot; + oid
&gt;&gt;&gt; entry
b&#39;100644 foo.py\x00\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e&#39;

Now, you could use struct.pack here, but it's a bit unwieldy. There's no good way to add a space except as a single characer. Also, you'd have to dynamically come up with the format string, since there is no format for "arbitrary sized, null terminated bytes string". But you can use an f-string and len(file.encode()) + 1. So it would need to be something like:

&gt;&gt;&gt; struct.pack(f&quot;&gt;6sc{len(filename.encode())+1}s20s&quot;, mode, b&quot; &quot;, filename.encode(), oid)
b&#39;100644 foo.py\x00\xda\x8bS\xbbYZ+\xd0\x16\x1fdp\xa4\xc3\xa8/j\xa1\xdc\x9e&#39;
&gt;&gt;&gt; struct.pack(f&quot;&gt;6sc{len(filename.encode())+1}s20s&quot;, mode, b&quot; &quot;, filename.encode(), oid) == entry
True

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python equivalent of Ruby's Array#pack, how to pack unknown string length and bytes together

问题

答案1

如何使这个循环/函数运行正确？

在Python中使用Selenium查找元素

在Python中使用Tkinter，我怎样使用网格（grid）来显示PandasTable而不是pack？

按第一个列表的顺序对列表的各个部分进行排序。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论