2023年6月12日 04:54:07go评论140阅读模式

英文:

Python's zlib doesn't work on CommonCrawl file

问题

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here are the first few lines of the output:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl

However, when I try to use Python's gzip or zlib library, using these code examples:

# using gzip
fh = gzip.open('wet.gz', 'rb')
data = fh.read()
fh.close()
# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = [o.decompress(open("wet.gz", "rb").read()), o.flush()]

Both of them return this:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

英文:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: &lt;urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d&gt;
Content-Type: application/warc-fields
Content-Length: 371
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://100bravert.main.jp/public_html/wiki/index.php?cmd=backup&amp;action=nowdiff&amp;page=Game_log%2F%EF%BC%A7%EF%BC%AD%E6%9F%98&amp;age=53
WARC-Date: 2022-08-07T15:32:56Z
WARC-Record-ID: &lt;urn:uuid:8dd329bf-6717-4d0c-ae05-93445c59fd50&gt;
WARC-Refers-To: &lt;urn:uuid:1e2e972b-4273-468a-953f-28b0e45fb117&gt;
WARC-Block-Digest: sha1:GTEJAN2GXLWBXDRNUEI3LLEHDIPJDPTU
WARC-Identified-Content-Language: jpn
Content-Type: text/plain
Content-Length: 12482
Game_log/ＧＭ柘 のバックアップの現在との差分(No.53) - PukiWiki
Game_log/ＧＭ柘 のバックアップの現在との差分(No.53)
[ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
バックアップ一覧

However, when I try to use Python's gzip or zlib library, using these code examples:

# using gzip
fh = gzip.open(&#39;wet.gz&#39;, &#39;rb&#39;)
data = fh.read(); fh.close()
# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = []
result = [o.decompress(open(&quot;wet.gz&quot;, &quot;rb&quot;).read()), o.flush()]

Both of them return this:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: &lt;urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d&gt;
Content-Type: application/warc-fields
Content-Length: 371
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

Edit for future readers: I've integrated the accepted answer to my Python package if you don't want to mess around:

pip install k1lib

from k1lib.imports import *
lines = cat(&quot;wet.gz&quot;, text=False, chunks=True) | unzip(text=True)
for line in lines:
    print(line)

This will read the file in binary mode chunk by chunk, unzips them incrementally, split up into multiple lines and convert them into strings.

答案1

得分: 4

Your wet.gz 文件由 31,849 个 gzip 成员串联而成。根据 gzip 标准，串联的有效 gzip 流是一个有效的 gzip 流。

Python 的 decompressobj() 不会自动继续读取和解压 gzip 成员。是的，我认为这是一个 bug，因为它没有遵守 gzip 标准。尽管这是常见的不遵守。

解决方法很简单。将 Python 解压缩放入循环中，继续解压，直到输入被消耗。o.unused_data 将返回解压最后一个成员后剩余的未使用输入，用于解压下一个成员。

（这也避免了将整个输入加载到内存中。作为示例，它在内存中累积整个输出，但如果可能的话，数据应该在到达时处理。）

Python 的 gzip.read() 在 wet.gz 上对我有效，可以解压整个文件。也许你使用的是较旧版本的 Python。

英文:

Your wet.gz consists of 31,849 gzip members, concatenated. Per the gzip standard, valid gzip streams concatenated is a valid gzip stream.

Python's decompressobj() is not automatically continuing to read and decompress the gzip members after the first. Yes, I would consider this to be a bug, since it is not complying with the gzip standard. Though this is a common failure to comply.

The workaround is simple. Put the Python decompression in a loop, continuing to decompress until the input is consumed. o.unused_data will return the unused input leftover after decompressing the last member, for use in decompressing the next member.

import zlib
f = open(&quot;wet.gz&quot;, &quot;rb&quot;)
o = zlib.decompressobj(zlib.MAX_WBITS + 16)
data = left = b&#39;&#39;
while True:
    got = f.read(32768)
    data += o.decompress(left + got)
    left = b&#39;&#39;
    if o.eof:
        left = o.unused_data
        o = zlib.decompressobj(zlib.MAX_WBITS + 16)
    if len(got) == 0 and len(left) == 0:
        break
f.close()

(That also avoids loading the entire input into memory. For illustration, it accumulates the entire output in memory, but if possible that data should be processed as it arrives instead.)

Python's gzip.read() works for me on wet.gz, decompressing the whole thing. Perhaps you have an older version of Python.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python的zlib在CommonCrawl文件上无法工作。

问题

答案1

如何从文本文件中将单词拆分为单个字母python

shutil.move 无法在不同驱动器之间移动

如何使用vSphere Python API检查vCenter Server的更新？

Numpy：列表中的累积差异

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。