英文:
zipfile.badzipfile even when I'm not reading zip file using pandas
问题
在当前路径中有多个文件夹,每个文件夹内部都有多个文件夹或者 xlsx 文件。我想要遍历每个文件夹,并读取 xlsx 文件,直到没有更多的文件夹或者所有 xlsx 文件都被读取完毕。总共有 50+ 个文件夹和 2000+ 个 Excel 文件。以下是我的代码:
import os
import pandas as pd
current_path=os.getcwd()
dfs = []
def process_folder(path):
for item in os.listdir(path):
item_path=os.path.join(path, item)
if os.path.isdir(item_path):
process_folder(item_path)
elif item.endswith('.xlsx'):
df = pd.read_excel(item_path)
dfs.append(df)
process_folder(current_path)
result_df = pd.concat(dfs, ignore_index=True)
result_df.to_excel('result.xlsx')
当我运行代码时,显示错误:"无法确定 Excel 文件,您必须手动指定一个引擎"。因此我修改了 read_excel: `df = pd.read_excel(item_path, engine='openpyxl')`。
然后出现错误:"zipfile.BadZipFile: 文件不是一个压缩文件"。
然而,我没有读取任何 zip 文件。不确定为什么会出现这个错误。
英文:
In the current path there are multiple folders, each folder has multiple folders or xlsx files inside,I want to iterate through each folder and read the xlsx files until there are no more folders or until all xlsx files are read. There are 50+ folders and 2000+ excel files. Below is my code:
import os
import pandas as pd
current_path=os.getcwd()
dfs = []
def process_folder(path):
for item in os.listdir(path):
item_path=os.path.join(path, item)
if os.path.isdir(item_path):
process_folder(item_path)
elif item.endswith('.xlsx'):
df = pd.read_excel(item_path)
dfs.append(df)
process_folder(current_path)
result_df = pd.concat(dfs, ignore_index=True)
result_df.to_excel('result.xlsx')
when I run the code, it shows error:"Excel file cannot be determined, you must specify an engine manually". So I modified read_excel: df = pd.read_excel(item_path, engine='openpyxl')
.
Then there is the error: "zipfile.BadZipFile: File is not a zip file."
However, I didn't read any zipfile. Not sure why this error shows up.
答案1
得分: 2
你可能有扩展名为.xlsx
但不是真正的Excel文件的文件。要找到它们,你可以使用:
import pathlib
for filename in pathlib.Path.cwd().glob('*.xlsx'):
with open(filename, 'rb') as xlsx:
sig = xlsx.read(2)
if sig != b'PK':
print(f'"{filename}" 似乎不是有效的Excel文件')
测试带有.xlsx
扩展名的文件的zip
文件签名应该足够了。
英文:
You probably have files with the extension .xlsx
but which are not real Excel files. To find them, you can use:
import pathlib
for filename in pathlib.Path.cwd().glob('*.xlsx'):
with open(filename, 'rb') as xlsx:
sig = xlsx.read(2)
if sig != b'PK':
print(f'"{filename}" does not appear to be a valid Excel file')
Testing the signature of a zip
file with .xlsx
extension should be sufficient for the moment.
答案2
得分: 1
这个错误发生的原因是XLSX
文件本质上是一个包含许多XML文件的zip文件。
我建议查看引发错误的XLSX
文件,然后检查是否受影响的文件可能只是重命名的XLS
文件;在Excel中打开这些文件,然后使用“另存为...”将它们保存为正确的XLSX
文件。
英文:
That error happens because a XLSX
file is, essentially, a zipfile with a bunch of XML files inside it.
I would recommend looking which of the XLSX
files are causing the error, then checking if the affected ones don't happen to be just renamed XLS
files; open those on Excel, them use Save as...
to save them as proper XLSX
files.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论