2023年6月5日 23:44:20go评论96阅读模式

英文:

Do failures seeking backwards in a gzip.GzipFile mean it's broken?

问题

我有一些具有小标题（8个字节，例如zrxxxxxx）的文件，后面跟着一个经过gzip压缩的数据流。大多数情况下，读取这样的文件都没有问题。但是在某些特定情况下，向后查找失败。以下是重现此问题的简单方法：

from gzip import GzipFile
f = open('test.bin', 'rb')
f.read(8)  # 读取 zrxxxxxx
h = GzipFile(fileobj=f, mode='rb')
h.seek(8192)
h.seek(8191)  # gzip.BadGzipFile: Not a gzipped file (b'zr')

不幸的是，我无法分享我的文件，但似乎任何类似的文件都会出现此问题。

在调试这种情况时，我注意到DecompressReader.seek（位于Lib/_compression.py中）有时会将原始文件倒回，我怀疑这可能导致了问题：

#...
# 倒回文件到数据流的开头。
def _rewind(self):
    self._fp.seek(0)
    #...
def seek(self, offset, whence=io.SEEK_SET):
    #...
    # 使偏移量成为要向前跳过的字节数。
    if offset < self._pos:
        self._rewind()
    else:
        offset -= self._pos
    #...

这是一个错误吗？还是我做错了什么？

是否有任何简单的解决方法？

英文:

I have files with a small header (8 bytes, say zrxxxxxx), followed by a gzipped stream of data. Reading such files works fine most of the time. However in very specific cases, seeking backwards fails. This is a simple way to reproduce:

from gzip import GzipFile
f = open(&#39;test.bin&#39;, &#39;rb&#39;)
f.read(8)  # Read zrxxxxxx
h = GzipFile(fileobj=f, mode=&#39;rb&#39;)
h.seek(8192)
h.seek(8191)  # gzip.BadGzipFile: Not a gzipped file (b&#39;zr&#39;)

Unfortunately I cannot share my file, but it looks like any similar file will do.

Debugging the situation, I noticed that DecompressReader.seek (in Lib/_compression.py) sometimes rewinds the original file, which I suspect causes the issue:

#...
# Rewind the file to the beginning of the data stream.
def _rewind(self):
    self._fp.seek(0)
    #...
def seek(self, offset, whence=io.SEEK_SET):
    #...
    # Make it so that offset is the number of bytes to skip forward.
    if offset &lt; self._pos:
        self._rewind()
    else:
        offset -= self._pos
    #...

Is this a bug? Or is it me doing it wrong?

Any simple workaround?

答案1

得分: 3

看起来是Python中的一个bug。当您要求它向后查找时，它必须返回到gzip流的开头并重新开始。但是库没有注意到给定的文件对象的偏移量，因此它不是倒回到gzip流的开头，而是倒回到文件的开头。

至于解决方法，您需要为GzipFile提供一个自定义文件对象，并替换seek()操作，以便seek(0)跳到正确的位置。这似乎有效：

from gzip import GzipFile
f = open('test.bin', 'rb')
f.read(8)  # Read zrxxxxxx
class shift():
    def __init__(self, f):
        self.f = f
        self.to = f.tell()
    def seek(self, offset):
        return self.f.seek(self.to + offset)
    def read(self, size=-1):
        return self.f.read(size)
s = shift(f)
h = GzipFile(fileobj=s, mode='rb')
h.seek(8192)
h.seek(8191)

（我不太了解Python，所以我相信有更好的方法。我尝试子类化file，以便只需拦截seek()，但不知何故file实际上不是一个类。）

英文:

Looks like a bug in Python. When you ask it to seek backwards, it has to go all the way back to the start of the gzip stream and start over. However the library did not take note of the offset of the file object it was given, so instead of rewinding to the start of the gzip stream, it is rewinding to the start of the file.

As for a workaround, you would need to give GzipFile a custom file object with a replaced seek() operation, such that seek(0) goes to the right place. This seemed to work:

from gzip import GzipFile
f = open(&#39;test.bin&#39;, &#39;rb&#39;)
f.read(8)  # Read zrxxxxxx
class shift():
    def __init__(self, f):
        self.f = f
        self.to = f.tell()
    def seek(self, offset):
        return self.f.seek(self.to + offset)
    def read(self, size=-1):
        return self.f.read(size)
s = shift(f)
h = GzipFile(fileobj=s, mode=&#39;rb&#39;)
h.seek(8192)
h.seek(8191)

(I don't really know Python, so I'm sure there's a better way. I tried to subclass file so that I would only need to intercept seek(), but somehow file is not actually a class.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个gzip.GzipFile中向后寻找失败是否意味着它损坏了？

问题

答案1

为什么 Visual Studio Code 打开第二个 Python 终端并且破坏了第一个？

从阿拉伯语的PDF中提取文本并获取反向文本。

你可以在Kivy的MapView中用画布圆圈替换标准标记。

Pandas- 分组字符串数值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。