2023年1月9日 09:18:39go评论146阅读模式

英文:

Python, Loop to reading in Excel file sheets, change header row number

问题

我有一个循环，用来计算xls文件中每个工作表的行数。但当我打开xls文件本身时，计数与Python返回的结果不一致。

这是因为第一个工作表的标题在第3行。如何修改我的代码以仅在第3行读取第一个工作表并忽略前两行？我的其他工作表始终从顶行开始，不包含标题。我想计算第一个工作表的长度，不包括标题。

然而，当我打开我的Excel并计算我的工作表时，我得到以下结果：

65522，标题从第3行开始，期望计数为65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

我的完整代码：

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk
def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        archive = ZipFile(file)
        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
        ]
        assert len(excel_files_in_archive) == 1
        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None, header=None,
        )
        row_counts = []
        for sheet in list(data_mapping.keys()):
            if sheet == 'Sheet1':
                df = data_mapping.get(sheet)[3:]
            else:
                df = data_mapping.get(sheet)
            row_counts.append(len(df))
            print(len(data_mapping.get(sheet)))
        file_mapping.update({file: sum(row_counts)})
    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]
    return frame
dir_path = r'D:\test22 - 10'
zip_files = []
for root, dirs, files in os.walk(dir_path):
    for file in files:
        if file.endswith('.zip'):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)

是否有人知道我做错了什么？

英文:

I have a loop that counts the rows in each sheet of an xls. When I open the xls itself the count is not aligning with what python is returning me.

It is due to the first sheet header being in row 3. How can I alter my code to read the first sheet ONLY in at row 3 and ignore the first two lines? The rest of my sheets ALWAYS start at the top row and contain no header. I would like to count the len of my first sheet without header included.

However when I open up my excel and count my sheet I am getting

65522 , header starts in row 3, expecting a count of 65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

my full code:

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk
def process_files(files: list) -&gt; pd.DataFrame:
    file_mapping = {}
    for file in files:
        #data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
        
        archive = ZipFile(file)
        # find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == &quot;.xls&quot;
        ]
        # ensure we only have one file (otherwise, loop or choose one somehow)
        assert len(excel_files_in_archive) == 1
        # read in data
        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None, header=None,
        )
        
        
               row_counts = []
    for sheet in list(data_mapping.keys()):
        if sheet == &#39;Sheet1&#39;:
            df = data_mapping.get(sheet)[3:]
         
        else:
              df = data_mapping.get(sheet)
        row_counts.append(len(df))
        print(len(data_mapping.get(sheet)))
      
        
        
        file_mapping.update({file: sum(row_counts)})
    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = [&quot;file_name&quot;, &quot;row_counts&quot;]
    return frame
dir_path = r&#39;D:\test22 - 10&#39;
zip_files = []
for root, dirs, files in os.walk(dir_path):
    for file in files:
        if file.endswith(&#39;.zip&#39;):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)   #function

does anyone have an idea on what im doing wrong?

答案1

得分: 4

你只需要使用skiprows参数：

# 读取数据
data_mapping = pd.read_excel(
    BytesIO(archive.read(excel_files_in_archive[0])),
    sheet_name=None, header=None, skiprows=2
)

或者不使用skiprows，然后直接切片工作表的数据框：

row_counts = []
for sheet in list(data_mapping.keys()):
    if sheet == '第一个工作表的名称':
        df = data_mapping.get(sheet)[3:]
    else:
        df = data_mapping.get(sheet)
    row_counts.append(len(df))
    print(len(data_mapping.get(sheet)))
## 或者根据列表中的位置进行切片，不需要在.keys()上调用list()
for sheet, i in enumerate(data_mapping.keys()):
    if i == 0:
        df = data_mapping.get(sheet)[3:]
    else:
        df = data_mapping.get(sheet)
    row_counts.append(len(df))
    print(len(data_mapping.get(sheet)))

英文:

You just need to use the skiprows argument:
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

# read in data
data_mapping = pd.read_excel(
     BytesIO(archive.read(excel_files_in_archive[0])),
     sheet_name=None, header=None, skiprows=2
)

or don't use skiprows and then slice the sheet's dataframe directly:

row_counts = []
for sheet in list(data_mapping.keys()):
     if sheet == &#39;name of first sheet&#39;:
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))
##or based on the location in the list. you don&#39;t need to call list() on .keys()
for sheet, i in enumerate(data_mapping.keys()):
     if i == 0:
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python，循环读取Excel文件的工作表，更改标题行号

问题

答案1

Signing a string with an RSA key in Python – how can I translate this JavaScript code that uses SubtleCrypto to Python?

Django Channels 与 Redis 在 WSL2 中

在签名中具有 TypeVars 的协议的实现者不能使用自己的类型代替。

重新塑造GRU的输入

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。