2023年2月26日 20:39:53go评论104阅读模式

英文:

JSONDecodeError when trying to read and format multiple json files in a directory in Python

问题

以下是你要翻译的代码部分：

import os
import pandas as pd
import json
def load_json_to_dataframe(json_file_path):
    with open(json_file_path, 'r') as json_file:
        doc = json.load(json_file)
        return pd.json_normalize(doc)
def read_json_files(folder_path):
    dataframes = []
    json_files = os.listdir(folder_path)
    for json_file in json_files:
        if json_file.endswith('.json'):
            df = load_json_to_dataframe(os.path.join(folder_path, json_file))
            dataframes.append(df)
    return pd.concat(dataframes, ignore_index=True)
folder_path = 'path/to/json/files'
combined_dataframe = read_json_files(folder_path)

这是你的Python代码，用于读取和格式化目录中的多个JSON文件。如果你需要关于这段代码的帮助，请告诉我。

英文:

I am trying to read and format multiple json files in a directory using Python. I have created a function load_json_to_dataframe to load and format the json data into a pandas dataframe, and another function read_json_files to read and append each dataframe to a list. However, I keep getting a JSONDecodeError when I run the code.

Here is the code I am using:

import os
import pandas as pd
import json
def load_json_to_dataframe(json_file_path):
    with open(json_file_path, &#39;r&#39;) as json_file:
        doc = json.load(json_file)
        return pd.json_normalize(doc)
def read_json_files(folder_path):
    dataframes = []
    json_files = os.listdir(folder_path)
    for json_file in json_files:
        if json_file.endswith(&#39;.json&#39;):
            df = load_json_to_dataframe(os.path.join(folder_path, json_file))
            dataframes.append(df)
    return pd.concat(dataframes, ignore_index=True)
folder_path = &#39;path/to/json/files&#39;
combined_dataframe = read_json_files(folder_path)

And this is the error message I am receiving:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I am not sure what is causing this error or how to fix it. Can anyone help me figure out what I am doing wrong and how to fix it? Thanks in advance.

Here a example of my data: https://drive.google.com/file/d/1h2J-e0cF9IbbWVO8ugrXMGdQTn-dGtsA/view?usp=sharing

Update:
There was a file with a different format than the others and therefore it was not read correctly, I have deleted it. Now it gives me a different error

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[1], line 20
     17     return pd.concat(dataframes, ignore_index=True)
     19 folder_path = &#39;C:/Users/gusta/Desktop/business/Emprendimiento&#39;
---&gt; 20 combined_dataframe = read_json_files(folder_path)
Cell In[1], line 17, in read_json_files(folder_path)
     15         df = load_json_to_dataframe(os.path.join(folder_path, json_file))
     16         dataframes.append(df)
---&gt; 17 return pd.concat(dataframes, ignore_index=True)
File c:\Users\gusta\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.&lt;locals&gt;.decorate.&lt;locals&gt;.wrapper(*args, **kwargs)
    325 if len(args) &gt; num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--&gt; 331 return func(*args, **kwargs)
File c:\Users\gusta\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\reshape\concat.py:381, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    159 &quot;&quot;&quot;
    160 Concatenate pandas objects along a particular axis.
    161 
...
    186 return self._blknos
File c:\Users\gusta\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\_libs\internals.pyx:718, in pandas._libs.internals.BlockManager._rebuild_blknos_and_blklocs()
MemoryError: Unable to allocate 966. KiB for an array with shape (123696,) and data type int64

答案1

得分: 0

最后，我通过以下代码找到了解决方案，主要问题如下：

我没有正确指定.json文件的层级，因此完全加载它会消耗太多内存。
出于上述原因，我将代码应用于一个产品类别，一次只处理一个产品类别。这样可以正常运行。

因此，我通过指定从.json文件中提取正确的层级来解决了问题，而不是提取整个文件然后进行筛选。另外，我修改了代码，以改进内存使用，一次只处理一个产品类别。

import pandas as pd
import json
import os
# 将JSON文件加载到Pandas DataFrame中的函数
def load_json_to_dataframe(json_file_path):
    with open(json_file_path, 'r') as json_file:
        # 将JSON文件加载到Python字典中
        doc = json.load(json_file)
        # 提取文件的创建时间
        file_creation_time = os.path.getctime(json_file_path)
        # 将创建时间转换为datetime对象
        file_creation_time = pd.to_datetime(file_creation_time, unit='s')
        # 规范化JSON数据并将创建时间作为新列添加
        df = pd.json_normalize(doc, meta=['id', 'title', 'condition', 'permalink',
                                          'category_id', 'domain_id', 'thumbnail',
                                          'currency_id', 'price', 'sold_quantity',
                                          'available_quantity', ['seller', 'id'],
                                          ['seller', 'nickname'], ['seller', 'permalink'],
                                          ['address', 'state_name'], ['address', 'city_name']])
        # 重新排列列，将file_creation_time列放在id列后面
        df = df[['id', 'title', 'condition', 'permalink', 'category_id', 'domain_id',
                 'thumbnail', 'currency_id', 'price', 'sold_quantity', 'available_quantity',
                 'seller.id', 'seller.nickname', 'seller.permalink', 'address.state_name',
                 'address.city_name']]
        # 将文件创建时间列添加到DataFrame
        df['file_creation_time'] = file_creation_time
        return df
# 将多个JSON文件读取到单个Pandas DataFrame中的函数
def read_json_files(folder_path, categories=None, batch_size=1000):
    if categories is None:
        # 如果未指定类别，则读取所有以'.json'结尾的文件
        json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]
    else:
        # 如果指定了类别，只读取与这些类别对应的文件
        json_files = [f for f in os.listdir(folder_path) if f.endswith('.json') and any(category in f for category in categories)]
    # 将文件列表拆分为给定大小的批次
    batches = [json_files[i:i+batch_size] for i in range(0, len(json_files), batch_size)]
    # 将每个批次的文件读取为DataFrame列表
    dfs = []
    for batch in batches:
        batch_dfs = [load_json_to_dataframe(os.path.join(folder_path, f)) for f in batch]
        dfs.append(pd.concat(batch_dfs, ignore_index=True))
    # 将所有DataFrame连接成一个单独的DataFrame
    return pd.concat(dfs, ignore_index=True)
# 指定要读取的文件类别和文件夹路径
categories = ['MLC4922.json']
folder_path = 'C:/path/to/folder/files'
# 将JSON文件读取到单个DataFrame中
combined_dataframe = read_json_files(folder_path, categories)

这段代码用于加载和处理JSON文件数据，解决了内存消耗和数据处理效率的问题。

英文:

Finally, I found the solution through the following code, and the main issues were:

I was not correctly specifying the layers of the .json file, so
loading it completely was consuming too much memory.
Due to the above reason, I made the code apply to one product
category at a time. This way it worked.

So, I was able to solve the problem by specifying the correct layers to extract from the .json file, rather than extracting the entire file and filtering later. Additionally, I modified the code to process only one product category at a time to improve memory usage.

import pandas as pd
import json
import os
# Function to load a JSON file into a Pandas DataFrame
def load_json_to_dataframe(json_file_path):
with open(json_file_path, &#39;r&#39;) as json_file:
# Load JSON file into a Python dictionary
doc = json.load(json_file)
# Extract the creation time of the file
file_creation_time = os.path.getctime(json_file_path)
# Convert the creation time to a datetime object
file_creation_time = pd.to_datetime(file_creation_time, unit=&#39;s&#39;)
# Normalize the JSON data and add the creation time as a new column
df = pd.json_normalize(doc, meta=[&#39;id&#39;, &#39;title&#39;, &#39;condition&#39;, &#39;permalink&#39;,
&#39;category_id&#39;, &#39;domain_id&#39;, &#39;thumbnail&#39;,
&#39;currency_id&#39;, &#39;price&#39;, &#39;sold_quantity&#39;,
&#39;available_quantity&#39;, [&#39;seller&#39;, &#39;id&#39;],
[&#39;seller&#39;, &#39;nickname&#39;], [&#39;seller&#39;, &#39;permalink&#39;],
[&#39;address&#39;, &#39;state_name&#39;], [&#39;address&#39;, &#39;city_name&#39;]])
# Reorder the columns to have the file_creation_time column after the id column
df = df[[&#39;id&#39;, &#39;title&#39;, &#39;condition&#39;, &#39;permalink&#39;, &#39;category_id&#39;, &#39;domain_id&#39;,
&#39;thumbnail&#39;, &#39;currency_id&#39;, &#39;price&#39;, &#39;sold_quantity&#39;, &#39;available_quantity&#39;,
&#39;seller.id&#39;, &#39;seller.nickname&#39;, &#39;seller.permalink&#39;, &#39;address.state_name&#39;,
&#39;address.city_name&#39;]]
# Add the file creation time column to the DataFrame
df[&#39;file_creation_time&#39;] = file_creation_time
return df
# Function to read multiple JSON files into a single Pandas DataFrame
def read_json_files(folder_path, categories=None, batch_size=1000):
if categories is None:
# If no categories are specified, read all files that end in &#39;.json&#39;
json_files = [f for f in os.listdir(folder_path) if f.endswith(&#39;.json&#39;)]
else:
# If categories are specified, read only files that correspond to those categories
json_files = [f for f in os.listdir(folder_path) if f.endswith(&#39;.json&#39;) and any(category in f for category in categories)]
# Split the list of files into batches of a given size
batches = [json_files[i:i+batch_size] for i in range(0, len(json_files), batch_size)]
# Read each batch of files into a list of DataFrames
dfs = []
for batch in batches:
batch_dfs = [load_json_to_dataframe(os.path.join(folder_path, f)) for f in batch]
dfs.append(pd.concat(batch_dfs, ignore_index=True))
# Concatenate all DataFrames into a single DataFrame
return pd.concat(dfs, ignore_index=True)
# Specify the categories of files to read and the folder path
categories = [&#39;MLC4922.json&#39;]
folder_path = &#39;C:/path/to/folder/files&#39;
# Read the JSON files into a single DataFrame
combined_dataframe = read_json_files(folder_path, categories)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

JSONDecodeError在尝试读取和格式化Python目录中的多个JSON文件时发生。

问题

答案1

训练 VGG16 从头开始在 Keras 中不会提高准确性。

如何对一个 JSON 数组的值进行平均，并将结果保留一位小数点。

获取必要的对象变量以重新创建具有init的对象

在二维列表中搜索（位置）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。