在一个gzip.GzipFile中向后寻找失败是否意味着它损坏了?

huangapple go评论67阅读模式
英文:

Do failures seeking backwards in a gzip.GzipFile mean it's broken?

问题

我有一些具有小标题(8个字节,例如zrxxxxxx)的文件,后面跟着一个经过gzip压缩的数据流。大多数情况下,读取这样的文件都没有问题。但是在某些特定情况下,向后查找失败。以下是重现此问题的简单方法:

from gzip import GzipFile

f = open('test.bin', 'rb')
f.read(8)  # 读取 zrxxxxxx

h = GzipFile(fileobj=f, mode='rb')
h.seek(8192)
h.seek(8191)  # gzip.BadGzipFile: Not a gzipped file (b'zr')

不幸的是,我无法分享我的文件,但似乎任何类似的文件都会出现此问题。

在调试这种情况时,我注意到DecompressReader.seek(位于Lib/_compression.py中)有时会将原始文件倒回,我怀疑这可能导致了问题:

#...
# 倒回文件到数据流的开头。
def _rewind(self):
    self._fp.seek(0)
    #...

def seek(self, offset, whence=io.SEEK_SET):
    #...
    # 使偏移量成为要向前跳过的字节数。
    if offset < self._pos:
        self._rewind()
    else:
        offset -= self._pos
    #...

这是一个错误吗?还是我做错了什么?

是否有任何简单的解决方法?

英文:

I have files with a small header (8 bytes, say zrxxxxxx), followed by a gzipped stream of data. Reading such files works fine most of the time. However in very specific cases, seeking backwards fails. This is a simple way to reproduce:

from gzip import GzipFile

f = open(&#39;test.bin&#39;, &#39;rb&#39;)
f.read(8)  # Read zrxxxxxx

h = GzipFile(fileobj=f, mode=&#39;rb&#39;)
h.seek(8192)
h.seek(8191)  # gzip.BadGzipFile: Not a gzipped file (b&#39;zr&#39;)

Unfortunately I cannot share my file, but it looks like any similar file will do.

Debugging the situation, I noticed that DecompressReader.seek (in Lib/_compression.py) sometimes rewinds the original file, which I suspect causes the issue:

#...
# Rewind the file to the beginning of the data stream.
def _rewind(self):
    self._fp.seek(0)
    #...

def seek(self, offset, whence=io.SEEK_SET):
    #...
    # Make it so that offset is the number of bytes to skip forward.
    if offset &lt; self._pos:
        self._rewind()
    else:
        offset -= self._pos
    #...

Is this a bug? Or is it me doing it wrong?

Any simple workaround?

答案1

得分: 3

看起来是Python中的一个bug。当您要求它向后查找时,它必须返回到gzip流的开头并重新开始。但是库没有注意到给定的文件对象的偏移量,因此它不是倒回到gzip流的开头,而是倒回到文件的开头。

至于解决方法,您需要为GzipFile提供一个自定义文件对象,并替换seek()操作,以便seek(0)跳到正确的位置。这似乎有效:

from gzip import GzipFile
f = open('test.bin', 'rb')
f.read(8)  # Read zrxxxxxx
class shift():
    def __init__(self, f):
        self.f = f
        self.to = f.tell()
    def seek(self, offset):
        return self.f.seek(self.to + offset)
    def read(self, size=-1):
        return self.f.read(size)
s = shift(f)
h = GzipFile(fileobj=s, mode='rb')
h.seek(8192)
h.seek(8191)

(我不太了解Python,所以我相信有更好的方法。我尝试子类化file,以便只需拦截seek(),但不知何故file实际上不是一个类。)

英文:

Looks like a bug in Python. When you ask it to seek backwards, it has to go all the way back to the start of the gzip stream and start over. However the library did not take note of the offset of the file object it was given, so instead of rewinding to the start of the gzip stream, it is rewinding to the start of the file.

As for a workaround, you would need to give GzipFile a custom file object with a replaced seek() operation, such that seek(0) goes to the right place. This seemed to work:

from gzip import GzipFile
f = open(&#39;test.bin&#39;, &#39;rb&#39;)
f.read(8)  # Read zrxxxxxx
class shift():
    def __init__(self, f):
        self.f = f
        self.to = f.tell()
    def seek(self, offset):
        return self.f.seek(self.to + offset)
    def read(self, size=-1):
        return self.f.read(size)
s = shift(f)
h = GzipFile(fileobj=s, mode=&#39;rb&#39;)
h.seek(8192)
h.seek(8191)

(I don't really know Python, so I'm sure there's a better way. I tried to subclass file so that I would only need to intercept seek(), but somehow file is not actually a class.)

huangapple
  • 本文由 发表于 2023年6月5日 23:44:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76408051.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定