zipfile.badzipfile 即使我没有使用 pandas 读取 zip 文件也会出现

huangapple go评论122阅读模式
英文:

zipfile.badzipfile even when I'm not reading zip file using pandas

问题

在当前路径中有多个文件夹每个文件夹内部都有多个文件夹或者 xlsx 文件我想要遍历每个文件夹并读取 xlsx 文件直到没有更多的文件夹或者所有 xlsx 文件都被读取完毕总共有 50+ 个文件夹和 2000+ 个 Excel 文件以下是我的代码

import os
import pandas as pd

current_path=os.getcwd()
dfs = []

def process_folder(path):
    for item in os.listdir(path):
        item_path=os.path.join(path, item)

        if os.path.isdir(item_path):
            process_folder(item_path)
        
        elif item.endswith('.xlsx'):
            df = pd.read_excel(item_path)
            dfs.append(df)

process_folder(current_path)
result_df = pd.concat(dfs, ignore_index=True)
result_df.to_excel('result.xlsx')

当我运行代码时显示错误"无法确定 Excel 文件,您必须手动指定一个引擎"因此我修改了 read_excel: `df = pd.read_excel(item_path, engine='openpyxl')`。
然后出现错误"zipfile.BadZipFile: 文件不是一个压缩文件"
然而我没有读取任何 zip 文件不确定为什么会出现这个错误
英文:

In the current path there are multiple folders, each folder has multiple folders or xlsx files inside,I want to iterate through each folder and read the xlsx files until there are no more folders or until all xlsx files are read. There are 50+ folders and 2000+ excel files. Below is my code:

import os
import pandas as pd

current_path=os.getcwd()
dfs = []

def process_folder(path):
    for item in os.listdir(path):
        item_path=os.path.join(path, item)

        if os.path.isdir(item_path):
            process_folder(item_path)
        
        elif item.endswith('.xlsx'):
            df = pd.read_excel(item_path)
            dfs.append(df)

process_folder(current_path)
result_df = pd.concat(dfs, ignore_index=True)
result_df.to_excel('result.xlsx')

when I run the code, it shows error:"Excel file cannot be determined, you must specify an engine manually". So I modified read_excel: df = pd.read_excel(item_path, engine='openpyxl').
Then there is the error: "zipfile.BadZipFile: File is not a zip file."
However, I didn't read any zipfile. Not sure why this error shows up.

答案1

得分: 2

你可能有扩展名为.xlsx但不是真正的Excel文件的文件。要找到它们,你可以使用:

import pathlib

for filename in pathlib.Path.cwd().glob('*.xlsx'):
    with open(filename, 'rb') as xlsx:
        sig = xlsx.read(2)
        if sig != b'PK':
            print(f'"{filename}" 似乎不是有效的Excel文件')

测试带有.xlsx扩展名的文件的zip文件签名应该足够了。

英文:

You probably have files with the extension .xlsx but which are not real Excel files. To find them, you can use:

import pathlib

for filename in pathlib.Path.cwd().glob('*.xlsx'):
    with open(filename, 'rb') as xlsx:
        sig = xlsx.read(2)
        if sig != b'PK':
            print(f'"{filename}" does not appear to be a valid Excel file')

Testing the signature of a zip file with .xlsx extension should be sufficient for the moment.

答案2

得分: 1

这个错误发生的原因是XLSX文件本质上是一个包含许多XML文件的zip文件。

我建议查看引发错误的XLSX文件,然后检查是否受影响的文件可能只是重命名的XLS文件;在Excel中打开这些文件,然后使用“另存为...”将它们保存为正确的XLSX文件。

英文:

That error happens because a XLSX file is, essentially, a zipfile with a bunch of XML files inside it.

I would recommend looking which of the XLSX files are causing the error, then checking if the affected ones don't happen to be just renamed XLS files; open those on Excel, them use Save as... to save them as proper XLSX files.

huangapple
  • 本文由 发表于 2023年7月3日 18:17:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76603818.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定