Python的zlib在CommonCrawl文件上无法工作。

huangapple go评论100阅读模式
英文:

Python's zlib doesn't work on CommonCrawl file

问题

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here are the first few lines of the output:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl

However, when I try to use Python's gzip or zlib library, using these code examples:

# using gzip
fh = gzip.open('wet.gz', 'rb')
data = fh.read()
fh.close()

# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = [o.decompress(open("wet.gz", "rb").read()), o.flush()]

Both of them return this:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

英文:

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here're the first few lines of the output:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: &lt;urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d&gt;
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl



WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://100bravert.main.jp/public_html/wiki/index.php?cmd=backup&amp;action=nowdiff&amp;page=Game_log%2F%EF%BC%A7%EF%BC%AD%E6%9F%98&amp;age=53
WARC-Date: 2022-08-07T15:32:56Z
WARC-Record-ID: &lt;urn:uuid:8dd329bf-6717-4d0c-ae05-93445c59fd50&gt;
WARC-Refers-To: &lt;urn:uuid:1e2e972b-4273-468a-953f-28b0e45fb117&gt;
WARC-Block-Digest: sha1:GTEJAN2GXLWBXDRNUEI3LLEHDIPJDPTU
WARC-Identified-Content-Language: jpn
Content-Type: text/plain
Content-Length: 12482

Game_log/GM柘 のバックアップの現在との差分(No.53) - PukiWiki
Game_log/GM柘 のバックアップの現在との差分(No.53)
[ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
バックアップ一覧

However, when I try to use Python's gzip or zlib library, using these code examples:

# using gzip
fh = gzip.open(&#39;wet.gz&#39;, &#39;rb&#39;)
data = fh.read(); fh.close()

# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = []
result = [o.decompress(open(&quot;wet.gz&quot;, &quot;rb&quot;).read()), o.flush()]

Both of them return this:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: &lt;urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d&gt;
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl




​

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

Edit for future readers: I've integrated the accepted answer to my Python package if you don't want to mess around:

pip install k1lib
from k1lib.imports import *
lines = cat(&quot;wet.gz&quot;, text=False, chunks=True) | unzip(text=True)
for line in lines:
    print(line)

This will read the file in binary mode chunk by chunk, unzips them incrementally, split up into multiple lines and convert them into strings.

答案1

得分: 4

Your wet.gz 文件由 31,849 个 gzip 成员串联而成。根据 gzip 标准,串联的有效 gzip 流是一个有效的 gzip 流。

Python 的 decompressobj() 不会自动继续读取和解压 gzip 成员。是的,我认为这是一个 bug,因为它没有遵守 gzip 标准。尽管这是常见的不遵守。

解决方法很简单。将 Python 解压缩放入循环中,继续解压,直到输入被消耗。o.unused_data 将返回解压最后一个成员后剩余的未使用输入,用于解压下一个成员。

(这也避免了将整个输入加载到内存中。作为示例,它在内存中累积整个输出,但如果可能的话,数据应该在到达时处理。)

Python 的 gzip.read() 在 wet.gz 上对我有效,可以解压整个文件。也许你使用的是较旧版本的 Python。

英文:

Your wet.gz consists of 31,849 gzip members, concatenated. Per the gzip standard, valid gzip streams concatenated is a valid gzip stream.

Python's decompressobj() is not automatically continuing to read and decompress the gzip members after the first. Yes, I would consider this to be a bug, since it is not complying with the gzip standard. Though this is a common failure to comply.

The workaround is simple. Put the Python decompression in a loop, continuing to decompress until the input is consumed. o.unused_data will return the unused input leftover after decompressing the last member, for use in decompressing the next member.

import zlib
f = open(&quot;wet.gz&quot;, &quot;rb&quot;)
o = zlib.decompressobj(zlib.MAX_WBITS + 16)
data = left = b&#39;&#39;
while True:
    got = f.read(32768)
    data += o.decompress(left + got)
    left = b&#39;&#39;
    if o.eof:
        left = o.unused_data
        o = zlib.decompressobj(zlib.MAX_WBITS + 16)
    if len(got) == 0 and len(left) == 0:
        break
f.close()

(That also avoids loading the entire input into memory. For illustration, it accumulates the entire output in memory, but if possible that data should be processed as it arrives instead.)

Python's gzip.read() works for me on wet.gz, decompressing the whole thing. Perhaps you have an older version of Python.

huangapple
  • 本文由 发表于 2023年6月12日 04:54:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76452480.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定