问题

在当前路径中有多个文件夹，每个文件夹内部都有多个文件夹或者 xlsx 文件。我想要遍历每个文件夹，并读取 xlsx 文件，直到没有更多的文件夹或者所有 xlsx 文件都被读取完毕。总共有 50+ 个文件夹和 2000+ 个 Excel 文件。以下是我的代码：

import os
import pandas as pd

current_path=os.getcwd()
dfs = []

def process_folder(path):
    for item in os.listdir(path):
        item_path=os.path.join(path, item)

        if os.path.isdir(item_path):
            process_folder(item_path)
        
        elif item.endswith('.xlsx'):
            df = pd.read_excel(item_path)
            dfs.append(df)

process_folder(current_path)
result_df = pd.concat(dfs, ignore_index=True)
result_df.to_excel('result.xlsx')

当我运行代码时，显示错误："无法确定 Excel 文件，您必须手动指定一个引擎"。因此我修改了 read_excel: `df = pd.read_excel(item_path, engine='openpyxl')`。
然后出现错误："zipfile.BadZipFile: 文件不是一个压缩文件"。
然而，我没有读取任何 zip 文件。不确定为什么会出现这个错误。

英文:

In the current path there are multiple folders, each folder has multiple folders or xlsx files inside,I want to iterate through each folder and read the xlsx files until there are no more folders or until all xlsx files are read. There are 50+ folders and 2000+ excel files. Below is my code:

import os
import pandas as pd

current_path=os.getcwd()
dfs = []

def process_folder(path):
    for item in os.listdir(path):
        item_path=os.path.join(path, item)

        if os.path.isdir(item_path):
            process_folder(item_path)
        
        elif item.endswith(&#39;.xlsx&#39;):
            df = pd.read_excel(item_path)
            dfs.append(df)

process_folder(current_path)
result_df = pd.concat(dfs, ignore_index=True)
result_df.to_excel(&#39;result.xlsx&#39;)

when I run the code, it shows error:"Excel file cannot be determined, you must specify an engine manually". So I modified read_excel: df = pd.read_excel(item_path, engine='openpyxl').
Then there is the error: "zipfile.BadZipFile: File is not a zip file."
However, I didn't read any zipfile. Not sure why this error shows up.

答案1

得分: 2

你可能有扩展名为.xlsx但不是真正的Excel文件的文件。要找到它们，你可以使用：

import pathlib

for filename in pathlib.Path.cwd().glob('*.xlsx'):
    with open(filename, 'rb') as xlsx:
        sig = xlsx.read(2)
        if sig != b'PK':
            print(f'"{filename}" 似乎不是有效的Excel文件')

测试带有.xlsx扩展名的文件的zip文件签名应该足够了。

英文:

You probably have files with the extension .xlsx but which are not real Excel files. To find them, you can use:

import pathlib

for filename in pathlib.Path.cwd().glob(&#39;*.xlsx&#39;):
    with open(filename, &#39;rb&#39;) as xlsx:
        sig = xlsx.read(2)
        if sig != b&#39;PK&#39;:
            print(f&#39;&quot;{filename}&quot; does not appear to be a valid Excel file&#39;)

Testing the signature of a zip file with .xlsx extension should be sufficient for the moment.

答案2

得分: 1

这个错误发生的原因是XLSX文件本质上是一个包含许多XML文件的zip文件。

我建议查看引发错误的XLSX文件，然后检查是否受影响的文件可能只是重命名的XLS文件；在Excel中打开这些文件，然后使用“另存为...”将它们保存为正确的XLSX文件。

英文:

That error happens because a XLSX file is, essentially, a zipfile with a bunch of XML files inside it.

I would recommend looking which of the XLSX files are causing the error, then checking if the affected ones don't happen to be just renamed XLS files; open those on Excel, them use Save as... to save them as proper XLSX files.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

zipfile.badzipfile 即使我没有使用 pandas 读取 zip 文件也会出现

问题

答案1

答案2

执行大量HTTP请求，每次异步执行N个。

检查3个不同数据框中的3列，并创建一个新列。

在Python 3中创建嵌套字典内的列表和元组。

优化使用zip()函数处理大数据计算的for循环

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论