如何高效解析Python中的大型JSON文件?

huangapple go评论70阅读模式
英文:

How to efficiently parse Large JSON files in Python?

问题

Summary:
我目前正在进行一个项目,需要在Python中解析非常大的JSON文件(超过10GB),我正在寻找优化解析代码性能的方法。我尝试使用Python中的json模块,但加载整个文件到内存中需要太长时间。我想知道是否有任何替代库或技术,资深开发人员在Python中处理如此大的JSON文件时使用过。

Explanation:
我正在处理一个项目,需要分析和提取非常大的JSON文件中的数据。这些文件太大,无法一次性加载到内存中,因此我需要找到一种高效的解析方法。我尝试使用Python内置的json模块,但加载文件到内存中需要很长时间。我还尝试使用ijsonjsonlines,但性能仍然不令人满意。我正在寻找关于替代库或技术的建议,可以帮助我优化解析代码并加速处理过程。

JSON示例:

{
  "orders": [
    {
      "order_id": "1234",
      "date": "2022-05-10",
      "total_amount": 245.50,
      "customer": {
        "name": "John Doe",
        "email": "johndoe@example.com",
        "address": {
          "street": "123 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        }
      },
      "items": [
        {
          "product_id": "6789",
          "name": "Widget",
          "price": 20.00,
          "quantity": 5
        },
        {
          "product_id": "2345",
          "name": "Gizmo",
          "price": 15.50,
          "quantity": 4
        }
      ]
    },
    {
      "order_id": "5678",
      "date": "2022-05-09",
      "total_amount": 175.00,
      "customer": {
        "name": "Jane Smith",
        "email": "janesmith@example.com",
        "address": {
          "street": "456 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        },
        "phone": "555-555-1212"
      },
      "items": [
        {
          "product_id": "9876",
          "name": "Thingamajig",
          "price": 25.00,
          "quantity": 3
        },
        {
          "product_id": "3456",
          "name": "Doodad",
          "price": 10.00,
          "quantity": 10
        }
      ]
    },
    {
      "order_id": "9012",
      "date": "2022-05-08",
      "total_amount": 150.25,
      "customer": {
        "name": "Bob Johnson",
        "email": "bjohnson@example.com",
        "address": {
          "street": "789 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        },
        "company": "ABC Inc."
      },
      "items": [
        {
          "product_id": "1234",
          "name": "Whatchamacallit",
          "price": 12.50,
          "quantity": 5
        },
        {
          "product_id": "5678",
          "name": "Doohickey",
          "price": 7.25,
          "quantity": 15
        }
      ]
    }
  ]
}

版本:
Python 3.8

以下是我的尝试:

import json

with open('large_file.json') as f:
    data = json.load(f)
import ijson

filename = 'large_file.json'
with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix.endswith('.name'):
            print(value)
import jsonlines

filename = 'large_file.json'
with open(filename, 'r') as f:
    reader = jsonlines.Reader(f)
    for obj in reader:
        print(obj)
英文:

Summary:
I am currently working on a project where I need to parse extremely large JSON files (over 10GB) in Python, and I am looking for ways to optimize the performance of my parsing code. I have tried using the json module in Python, but it is taking too long to load the entire file into memory. I am wondering if there are any alternative libraries or techniques that senior developers have used to handle such large JSON files in Python.

Explanation:
I am working on a project where I need to analyze and extract data from very large JSON files. The files are too large to be loaded into memory all at once, so I need to find an efficient way to parse them. I have tried using the built-in json module in Python, but it is taking a long time to load the file into memory. I have also tried using ijson and jsonlines, but the performance is still not satisfactory. I am looking for suggestions on alternative libraries or techniques that could help me optimize my parsing code and speed up the process.

Example of the JSON:

{
"orders": [
{
"order_id": "1234",
"date": "2022-05-10",
"total_amount": 245.50,
"customer": {
"name": "John Doe",
"email": "johndoe@example.com",
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
}
},
"items": [
{
"product_id": "6789",
"name": "Widget",
"price": 20.00,
"quantity": 5
},
{
"product_id": "2345",
"name": "Gizmo",
"price": 15.50,
"quantity": 4
}
]
},
{
"order_id": "5678",
"date": "2022-05-09",
"total_amount": 175.00,
"customer": {
"name": "Jane Smith",
"email": "janesmith@example.com",
"address": {
"street": "456 Main St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
},
"phone": "555-555-1212"
},
"items": [
{
"product_id": "9876",
"name": "Thingamajig",
"price": 25.00,
"quantity": 3
},
{
"product_id": "3456",
"name": "Doodad",
"price": 10.00,
"quantity": 10
}
]
},
{
"order_id": "9012",
"date": "2022-05-08",
"total_amount": 150.25,
"customer": {
"name": "Bob Johnson",
"email": "bjohnson@example.com",
"address": {
"street": "789 Main St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
},
"company": "ABC Inc."
},
"items": [
{
"product_id": "1234",
"name": "Whatchamacallit",
"price": 12.50,
"quantity": 5
},
{
"product_id": "5678",
"name": "Doohickey",
"price": 7.25,
"quantity": 15
}
]
}
]
}

Version:
Python 3.8

Here's what I tried:

import json
with open('large_file.json') as f:
data = json.load(f)
import ijson
filename = 'large_file.json'
with open(filename, 'r') as f:
parser = ijson.parse(f)
for prefix, event, value in parser:
if prefix.endswith('.name'):
print(value)
import jsonlines
filename = 'large_file.json'
with open(filename, 'r') as f:
reader = jsonlines.Reader(f)
for obj in reader:
print(obj)

答案1

得分: 0

你可以尝试使用Pandas,因为理论上Pandas也可以处理JSON,或者你甚至可以尝试使用SQLITE,因为它可以解析JSON,将JSON存储在列中,并且还可以查询JSON。但我建议你使用Pandas,因为它更容易使用,而且在线文档更多。在Pandas中,你可以这样做 -

import pandas as pd
file = pd.read_json("your-filename.json")
print(file)
英文:

You could try to use Pandas, as in theory Pandas can also handle json, or you could even try using SQLITE, as it can parse JSON, store JSON in columns and also query JSON. But I would recommend that you use Pandas as it is easier to use and has more documentation online. You could do it like this in Pandas -

import pandas as pd
file = pd.read_json("your-filename.json")
print(file)

huangapple
  • 本文由 发表于 2023年5月10日 12:10:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76214829.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定