Python的zlib在CommonCrawl文件上无法工作。

huangapple go评论140阅读模式
英文:

Python's zlib doesn't work on CommonCrawl file

问题

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here are the first few lines of the output:

  1. WARC/1.0
  2. WARC-Type: warcinfo
  3. WARC-Date: 2022-08-20T09:26:35Z
  4. WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
  5. WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
  6. Content-Type: application/warc-fields
  7. Content-Length: 371
  8. Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
  9. Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
  10. robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
  11. isPartOf: CC-MAIN-2022-33
  12. operator: Common Crawl Admin (info@commoncrawl.org)
  13. description: Wide crawl of the web for August 2022
  14. publisher: Common Crawl

However, when I try to use Python's gzip or zlib library, using these code examples:

  1. # using gzip
  2. fh = gzip.open('wet.gz', 'rb')
  3. data = fh.read()
  4. fh.close()
  5. # using zlib
  6. o = zlib.decompressobj(zlib.MAX_WBITS|16)
  7. result = [o.decompress(open("wet.gz", "rb").read()), o.flush()]

Both of them return this:

  1. WARC/1.0
  2. WARC-Type: warcinfo
  3. WARC-Date: 2022-08-20T09:26:35Z
  4. WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
  5. WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
  6. Content-Type: application/warc-fields
  7. Content-Length: 371
  8. Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
  9. Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
  10. robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
  11. isPartOf: CC-MAIN-2022-33
  12. operator: Common Crawl Admin (info@commoncrawl.org)
  13. description: Wide crawl of the web for August 2022
  14. publisher: Common Crawl

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

英文:

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here're the first few lines of the output:

  1. WARC/1.0
  2. WARC-Type: warcinfo
  3. WARC-Date: 2022-08-20T09:26:35Z
  4. WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
  5. WARC-Record-ID: &lt;urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d&gt;
  6. Content-Type: application/warc-fields
  7. Content-Length: 371
  8. Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
  9. Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
  10. robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
  11. isPartOf: CC-MAIN-2022-33
  12. operator: Common Crawl Admin (info@commoncrawl.org)
  13. description: Wide crawl of the web for August 2022
  14. publisher: Common Crawl
  15. WARC/1.0
  16. WARC-Type: conversion
  17. WARC-Target-URI: http://100bravert.main.jp/public_html/wiki/index.php?cmd=backup&amp;action=nowdiff&amp;page=Game_log%2F%EF%BC%A7%EF%BC%AD%E6%9F%98&amp;age=53
  18. WARC-Date: 2022-08-07T15:32:56Z
  19. WARC-Record-ID: &lt;urn:uuid:8dd329bf-6717-4d0c-ae05-93445c59fd50&gt;
  20. WARC-Refers-To: &lt;urn:uuid:1e2e972b-4273-468a-953f-28b0e45fb117&gt;
  21. WARC-Block-Digest: sha1:GTEJAN2GXLWBXDRNUEI3LLEHDIPJDPTU
  22. WARC-Identified-Content-Language: jpn
  23. Content-Type: text/plain
  24. Content-Length: 12482
  25. Game_log/GM柘 のバックアップの現在との差分(No.53) - PukiWiki
  26. Game_log/GM柘 のバックアップの現在との差分(No.53)
  27. [ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
  28. バックアップ一覧

However, when I try to use Python's gzip or zlib library, using these code examples:

  1. # using gzip
  2. fh = gzip.open(&#39;wet.gz&#39;, &#39;rb&#39;)
  3. data = fh.read(); fh.close()
  4. # using zlib
  5. o = zlib.decompressobj(zlib.MAX_WBITS|16)
  6. result = []
  7. result = [o.decompress(open(&quot;wet.gz&quot;, &quot;rb&quot;).read()), o.flush()]

Both of them return this:

  1. WARC/1.0
  2. WARC-Type: warcinfo
  3. WARC-Date: 2022-08-20T09:26:35Z
  4. WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
  5. WARC-Record-ID: &lt;urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d&gt;
  6. Content-Type: application/warc-fields
  7. Content-Length: 371
  8. Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
  9. Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
  10. robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
  11. isPartOf: CC-MAIN-2022-33
  12. operator: Common Crawl Admin (info@commoncrawl.org)
  13. description: Wide crawl of the web for August 2022
  14. publisher: Common Crawl

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

Edit for future readers: I've integrated the accepted answer to my Python package if you don't want to mess around:

  1. pip install k1lib
  1. from k1lib.imports import *
  2. lines = cat(&quot;wet.gz&quot;, text=False, chunks=True) | unzip(text=True)
  3. for line in lines:
  4. print(line)

This will read the file in binary mode chunk by chunk, unzips them incrementally, split up into multiple lines and convert them into strings.

答案1

得分: 4

Your wet.gz 文件由 31,849 个 gzip 成员串联而成。根据 gzip 标准,串联的有效 gzip 流是一个有效的 gzip 流。

Python 的 decompressobj() 不会自动继续读取和解压 gzip 成员。是的,我认为这是一个 bug,因为它没有遵守 gzip 标准。尽管这是常见的不遵守。

解决方法很简单。将 Python 解压缩放入循环中,继续解压,直到输入被消耗。o.unused_data 将返回解压最后一个成员后剩余的未使用输入,用于解压下一个成员。

(这也避免了将整个输入加载到内存中。作为示例,它在内存中累积整个输出,但如果可能的话,数据应该在到达时处理。)

Python 的 gzip.read() 在 wet.gz 上对我有效,可以解压整个文件。也许你使用的是较旧版本的 Python。

英文:

Your wet.gz consists of 31,849 gzip members, concatenated. Per the gzip standard, valid gzip streams concatenated is a valid gzip stream.

Python's decompressobj() is not automatically continuing to read and decompress the gzip members after the first. Yes, I would consider this to be a bug, since it is not complying with the gzip standard. Though this is a common failure to comply.

The workaround is simple. Put the Python decompression in a loop, continuing to decompress until the input is consumed. o.unused_data will return the unused input leftover after decompressing the last member, for use in decompressing the next member.

  1. import zlib
  2. f = open(&quot;wet.gz&quot;, &quot;rb&quot;)
  3. o = zlib.decompressobj(zlib.MAX_WBITS + 16)
  4. data = left = b&#39;&#39;
  5. while True:
  6. got = f.read(32768)
  7. data += o.decompress(left + got)
  8. left = b&#39;&#39;
  9. if o.eof:
  10. left = o.unused_data
  11. o = zlib.decompressobj(zlib.MAX_WBITS + 16)
  12. if len(got) == 0 and len(left) == 0:
  13. break
  14. f.close()

(That also avoids loading the entire input into memory. For illustration, it accumulates the entire output in memory, but if possible that data should be processed as it arrives instead.)

Python's gzip.read() works for me on wet.gz, decompressing the whole thing. Perhaps you have an older version of Python.

huangapple
  • 本文由 发表于 2023年6月12日 04:54:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76452480.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定