JSONDecodeError在尝试读取和格式化Python目录中的多个JSON文件时发生。

huangapple go评论104阅读模式
英文:

JSONDecodeError when trying to read and format multiple json files in a directory in Python

问题

以下是你要翻译的代码部分:

  1. import os
  2. import pandas as pd
  3. import json
  4. def load_json_to_dataframe(json_file_path):
  5. with open(json_file_path, 'r') as json_file:
  6. doc = json.load(json_file)
  7. return pd.json_normalize(doc)
  8. def read_json_files(folder_path):
  9. dataframes = []
  10. json_files = os.listdir(folder_path)
  11. for json_file in json_files:
  12. if json_file.endswith('.json'):
  13. df = load_json_to_dataframe(os.path.join(folder_path, json_file))
  14. dataframes.append(df)
  15. return pd.concat(dataframes, ignore_index=True)
  16. folder_path = 'path/to/json/files'
  17. combined_dataframe = read_json_files(folder_path)

这是你的Python代码,用于读取和格式化目录中的多个JSON文件。如果你需要关于这段代码的帮助,请告诉我。

英文:

I am trying to read and format multiple json files in a directory using Python. I have created a function load_json_to_dataframe to load and format the json data into a pandas dataframe, and another function read_json_files to read and append each dataframe to a list. However, I keep getting a JSONDecodeError when I run the code.

Here is the code I am using:

  1. import os
  2. import pandas as pd
  3. import json
  4. def load_json_to_dataframe(json_file_path):
  5. with open(json_file_path, 'r') as json_file:
  6. doc = json.load(json_file)
  7. return pd.json_normalize(doc)
  8. def read_json_files(folder_path):
  9. dataframes = []
  10. json_files = os.listdir(folder_path)
  11. for json_file in json_files:
  12. if json_file.endswith('.json'):
  13. df = load_json_to_dataframe(os.path.join(folder_path, json_file))
  14. dataframes.append(df)
  15. return pd.concat(dataframes, ignore_index=True)
  16. folder_path = 'path/to/json/files'
  17. combined_dataframe = read_json_files(folder_path)

And this is the error message I am receiving:

  1. JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I am not sure what is causing this error or how to fix it. Can anyone help me figure out what I am doing wrong and how to fix it? Thanks in advance.

Here a example of my data: https://drive.google.com/file/d/1h2J-e0cF9IbbWVO8ugrXMGdQTn-dGtsA/view?usp=sharing

Update:
There was a file with a different format than the others and therefore it was not read correctly, I have deleted it. Now it gives me a different error

  1. ---------------------------------------------------------------------------
  2. MemoryError Traceback (most recent call last)
  3. Cell In[1], line 20
  4. 17 return pd.concat(dataframes, ignore_index=True)
  5. 19 folder_path = 'C:/Users/gusta/Desktop/business/Emprendimiento'
  6. ---> 20 combined_dataframe = read_json_files(folder_path)
  7. Cell In[1], line 17, in read_json_files(folder_path)
  8. 15 df = load_json_to_dataframe(os.path.join(folder_path, json_file))
  9. 16 dataframes.append(df)
  10. ---> 17 return pd.concat(dataframes, ignore_index=True)
  11. File c:\Users\gusta\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
  12. 325 if len(args) > num_allow_args:
  13. 326 warnings.warn(
  14. 327 msg.format(arguments=_format_argument_list(allow_args)),
  15. 328 FutureWarning,
  16. 329 stacklevel=find_stack_level(),
  17. 330 )
  18. --> 331 return func(*args, **kwargs)
  19. File c:\Users\gusta\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\reshape\concat.py:381, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
  20. 159 """
  21. 160 Concatenate pandas objects along a particular axis.
  22. 161
  23. ...
  24. 186 return self._blknos
  25. File c:\Users\gusta\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\_libs\internals.pyx:718, in pandas._libs.internals.BlockManager._rebuild_blknos_and_blklocs()
  26. MemoryError: Unable to allocate 966. KiB for an array with shape (123696,) and data type int64

答案1

得分: 0

最后,我通过以下代码找到了解决方案,主要问题如下:

  1. 我没有正确指定.json文件的层级,因此完全加载它会消耗太多内存。

  2. 出于上述原因,我将代码应用于一个产品类别,一次只处理一个产品类别。这样可以正常运行。

因此,我通过指定从.json文件中提取正确的层级来解决了问题,而不是提取整个文件然后进行筛选。另外,我修改了代码,以改进内存使用,一次只处理一个产品类别。

  1. import pandas as pd
  2. import json
  3. import os
  4. # 将JSON文件加载到Pandas DataFrame中的函数
  5. def load_json_to_dataframe(json_file_path):
  6. with open(json_file_path, 'r') as json_file:
  7. # 将JSON文件加载到Python字典中
  8. doc = json.load(json_file)
  9. # 提取文件的创建时间
  10. file_creation_time = os.path.getctime(json_file_path)
  11. # 将创建时间转换为datetime对象
  12. file_creation_time = pd.to_datetime(file_creation_time, unit='s')
  13. # 规范化JSON数据并将创建时间作为新列添加
  14. df = pd.json_normalize(doc, meta=['id', 'title', 'condition', 'permalink',
  15. 'category_id', 'domain_id', 'thumbnail',
  16. 'currency_id', 'price', 'sold_quantity',
  17. 'available_quantity', ['seller', 'id'],
  18. ['seller', 'nickname'], ['seller', 'permalink'],
  19. ['address', 'state_name'], ['address', 'city_name']])
  20. # 重新排列列,将file_creation_time列放在id列后面
  21. df = df[['id', 'title', 'condition', 'permalink', 'category_id', 'domain_id',
  22. 'thumbnail', 'currency_id', 'price', 'sold_quantity', 'available_quantity',
  23. 'seller.id', 'seller.nickname', 'seller.permalink', 'address.state_name',
  24. 'address.city_name']]
  25. # 将文件创建时间列添加到DataFrame
  26. df['file_creation_time'] = file_creation_time
  27. return df
  28. # 将多个JSON文件读取到单个Pandas DataFrame中的函数
  29. def read_json_files(folder_path, categories=None, batch_size=1000):
  30. if categories is None:
  31. # 如果未指定类别,则读取所有以'.json'结尾的文件
  32. json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]
  33. else:
  34. # 如果指定了类别,只读取与这些类别对应的文件
  35. json_files = [f for f in os.listdir(folder_path) if f.endswith('.json') and any(category in f for category in categories)]
  36. # 将文件列表拆分为给定大小的批次
  37. batches = [json_files[i:i+batch_size] for i in range(0, len(json_files), batch_size)]
  38. # 将每个批次的文件读取为DataFrame列表
  39. dfs = []
  40. for batch in batches:
  41. batch_dfs = [load_json_to_dataframe(os.path.join(folder_path, f)) for f in batch]
  42. dfs.append(pd.concat(batch_dfs, ignore_index=True))
  43. # 将所有DataFrame连接成一个单独的DataFrame
  44. return pd.concat(dfs, ignore_index=True)
  45. # 指定要读取的文件类别和文件夹路径
  46. categories = ['MLC4922.json']
  47. folder_path = 'C:/path/to/folder/files'
  48. # 将JSON文件读取到单个DataFrame中
  49. combined_dataframe = read_json_files(folder_path, categories)

这段代码用于加载和处理JSON文件数据,解决了内存消耗和数据处理效率的问题。

英文:

Finally, I found the solution through the following code, and the main issues were:

  1. I was not correctly specifying the layers of the .json file, so
    loading it completely was consuming too much memory.

  2. Due to the above reason, I made the code apply to one product
    category at a time. This way it worked.

So, I was able to solve the problem by specifying the correct layers to extract from the .json file, rather than extracting the entire file and filtering later. Additionally, I modified the code to process only one product category at a time to improve memory usage.

  1. import pandas as pd
  2. import json
  3. import os
  4. # Function to load a JSON file into a Pandas DataFrame
  5. def load_json_to_dataframe(json_file_path):
  6. with open(json_file_path, 'r') as json_file:
  7. # Load JSON file into a Python dictionary
  8. doc = json.load(json_file)
  9. # Extract the creation time of the file
  10. file_creation_time = os.path.getctime(json_file_path)
  11. # Convert the creation time to a datetime object
  12. file_creation_time = pd.to_datetime(file_creation_time, unit='s')
  13. # Normalize the JSON data and add the creation time as a new column
  14. df = pd.json_normalize(doc, meta=['id', 'title', 'condition', 'permalink',
  15. 'category_id', 'domain_id', 'thumbnail',
  16. 'currency_id', 'price', 'sold_quantity',
  17. 'available_quantity', ['seller', 'id'],
  18. ['seller', 'nickname'], ['seller', 'permalink'],
  19. ['address', 'state_name'], ['address', 'city_name']])
  20. # Reorder the columns to have the file_creation_time column after the id column
  21. df = df[['id', 'title', 'condition', 'permalink', 'category_id', 'domain_id',
  22. 'thumbnail', 'currency_id', 'price', 'sold_quantity', 'available_quantity',
  23. 'seller.id', 'seller.nickname', 'seller.permalink', 'address.state_name',
  24. 'address.city_name']]
  25. # Add the file creation time column to the DataFrame
  26. df['file_creation_time'] = file_creation_time
  27. return df
  28. # Function to read multiple JSON files into a single Pandas DataFrame
  29. def read_json_files(folder_path, categories=None, batch_size=1000):
  30. if categories is None:
  31. # If no categories are specified, read all files that end in '.json'
  32. json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]
  33. else:
  34. # If categories are specified, read only files that correspond to those categories
  35. json_files = [f for f in os.listdir(folder_path) if f.endswith('.json') and any(category in f for category in categories)]
  36. # Split the list of files into batches of a given size
  37. batches = [json_files[i:i+batch_size] for i in range(0, len(json_files), batch_size)]
  38. # Read each batch of files into a list of DataFrames
  39. dfs = []
  40. for batch in batches:
  41. batch_dfs = [load_json_to_dataframe(os.path.join(folder_path, f)) for f in batch]
  42. dfs.append(pd.concat(batch_dfs, ignore_index=True))
  43. # Concatenate all DataFrames into a single DataFrame
  44. return pd.concat(dfs, ignore_index=True)
  45. # Specify the categories of files to read and the folder path
  46. categories = ['MLC4922.json']
  47. folder_path = 'C:/path/to/folder/files'
  48. # Read the JSON files into a single DataFrame
  49. combined_dataframe = read_json_files(folder_path, categories)

huangapple
  • 本文由 发表于 2023年2月26日 20:39:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75572029.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定