Python – Pandas读取“.json.gz”文件时出现“BadGzipFile”错误

huangapple go评论62阅读模式
英文:

Python - Pandas "BadGzipFile" Error When Reading in ".json.gz" File

问题

我试图从一个".json.gz"文件中读取数据并将其作为数据框。我一直收到一个错误,指示它是一个"BadGzipFile"。然而,当我手动解压缩文件(即在我的查找器中双击它)时,我能够成功打开JSON文件。这让我相信文件是正常的,但当我在Python中运行以下代码时,我收到"BadGzipFile"错误。

我对.gzip文件非常陌生,并已经进行了相当多的研究,试图找出问题所在。到目前为止,我没有成功。任何帮助将不胜感激!

这是我的代码:

import os
import json
import gzip

file_path = '/data/data_0_0_0.json.gz'

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, compression='gzip', lines=True)

这是我收到的错误:

BadGzipFile: Not a gzipped file (b'{"')
英文:

I am trying to read in data from a ".json.gz" file as a dataframe. I keep getting an error indicating that it is a "BadGzipFile". However, when I unzip the file manually (i.e., just double clicking it in my finder), I am able to successfully open the json file. This leads me to believe that the file is fine, but when I run the below code in Python, I receive the "BadGzipFile" error.

I am very new to .gzip files and have done a fair bit of research trying to figure out what the issue is. So far, I have been unsuccessful. Any help would be greatly appreciated!

Here is my code:

import os
import json
import gzip

file_path = '/data/data_0_0_0.json.gz'

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, compression='gzip', lines=True)

And here is the error I am receiving:

BadGzipFile: Not a gzipped file (b'{"')

答案1

得分: 3

以下是翻译好的部分:

在你的代码中发生的情况是:

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, compression='gzip', lines=True)

你正在打开一个名为 file_path 的 Gzip 文件。然后,你告诉 Pandas,你打开的东西 (f) 本身也是另一个 Gzip 文件。但实际上它不是,它是一个 Json 文件。当它显示 BadGzipFile 并带有起始括号时,它是在告诉你它找到的文件以括号开头,而不是Gzip 文件的魔术数字


你应该将代码更改为使用 gzip 打开文件,然后直接读取生成的文件,或者让 Pandas 读取文件。

第一种方式是:

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, lines=True)

第二种方式更简单。因为 pd.read_json 会根据文件名和你的文件以 .gz 结尾来推断压缩格式,你可以直接写:

df = pd.read_json(file_path)
英文:

What's happening with your code here:

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, compression='gzip', lines=True)

Is that you're opening a Gzip file at file_path. Then you're telling Pandas that the thing that you opened (f), is itself another Gzip file. It isn't; it's a Json file. When it says BadGzipFile with that starting bracket, it is telling you that the file it found starts with a bracket instead of the Gzip file's magic number.


You should change it either to open the file with gzip and then directly read the resulting file or have Pandas read the file.

The first would be:

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, lines=True)

The second is actually easier. Because pd.read_json will infer the compression format based on the file name and your file ends with .gz, you can just write:

df = pd.read_json(file_path)

huangapple
  • 本文由 发表于 2023年2月13日 23:33:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438003.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定