Python,循环读取Excel文件的工作表,更改标题行号

huangapple go评论132阅读模式
英文:

Python, Loop to reading in Excel file sheets, change header row number

问题

我有一个循环,用来计算xls文件中每个工作表的行数。但当我打开xls文件本身时,计数与Python返回的结果不一致。

这是因为第一个工作表的标题在第3行。如何修改我的代码以仅在第3行读取第一个工作表并忽略前两行?我的其他工作表始终从顶行开始,不包含标题。我想计算第一个工作表的长度,不包括标题。

然而,当我打开我的Excel并计算我的工作表时,我得到以下结果:

65522,标题从第3行开始,期望计数为65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

我的完整代码:

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk

def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        archive = ZipFile(file)

        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
        ]
        assert len(excel_files_in_archive) == 1

        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None, header=None,
        )

        row_counts = []
        for sheet in list(data_mapping.keys()):
            if sheet == 'Sheet1':
                df = data_mapping.get(sheet)[3:]
            else:
                df = data_mapping.get(sheet)
            row_counts.append(len(df))
            print(len(data_mapping.get(sheet)))

        file_mapping.update({file: sum(row_counts)})

    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]

    return frame

dir_path = r'D:\test22 - 10'

zip_files = []
for root, dirs, files in os.walk(dir_path):
    for file in files:
        if file.endswith('.zip'):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)

是否有人知道我做错了什么?

英文:

I have a loop that counts the rows in each sheet of an xls. When I open the xls itself the count is not aligning with what python is returning me.

It is due to the first sheet header being in row 3. How can I alter my code to read the first sheet ONLY in at row 3 and ignore the first two lines? The rest of my sheets ALWAYS start at the top row and contain no header. I would like to count the len of my first sheet without header included.

However when I open up my excel and count my sheet I am getting

65522 , header starts in row 3, expecting a count of 65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

my full code:

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk


def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        #data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
        
        archive = ZipFile(file)

        # find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
        ]
        # ensure we only have one file (otherwise, loop or choose one somehow)
        assert len(excel_files_in_archive) == 1

        # read in data
        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None, header=None,
        )

        
        
               row_counts = []
    for sheet in list(data_mapping.keys()):
        if sheet == 'Sheet1':
            df = data_mapping.get(sheet)[3:]
         
        else:
              df = data_mapping.get(sheet)
        row_counts.append(len(df))
        print(len(data_mapping.get(sheet)))


      
        
        

        file_mapping.update({file: sum(row_counts)})

    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]

    return frame



dir_path = r'D:\test22 - 10'





zip_files = []
for root, dirs, files in os.walk(dir_path):
    for file in files:
        if file.endswith('.zip'):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)   #function

does anyone have an idea on what im doing wrong?

答案1

得分: 4

你只需要使用skiprows参数:

# 读取数据
data_mapping = pd.read_excel(
    BytesIO(archive.read(excel_files_in_archive[0])),
    sheet_name=None, header=None, skiprows=2
)

或者不使用skiprows,然后直接切片工作表的数据框:

row_counts = []
for sheet in list(data_mapping.keys()):
    if sheet == '第一个工作表的名称':
        df = data_mapping.get(sheet)[3:]
    else:
        df = data_mapping.get(sheet)
    row_counts.append(len(df))
    print(len(data_mapping.get(sheet)))

## 或者根据列表中的位置进行切片,不需要在.keys()上调用list()
for sheet, i in enumerate(data_mapping.keys()):
    if i == 0:
        df = data_mapping.get(sheet)[3:]
    else:
        df = data_mapping.get(sheet)
    row_counts.append(len(df))
    print(len(data_mapping.get(sheet)))
英文:

You just need to use the skiprows argument:
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

# read in data
data_mapping = pd.read_excel(
     BytesIO(archive.read(excel_files_in_archive[0])),
     sheet_name=None, header=None, skiprows=2
)

or don't use skiprows and then slice the sheet's dataframe directly:

row_counts = []
for sheet in list(data_mapping.keys()):
     if sheet == 'name of first sheet':
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))

##or based on the location in the list. you don't need to call list() on .keys()
for sheet, i in enumerate(data_mapping.keys()):
     if i == 0:
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))

huangapple
  • 本文由 发表于 2023年1月9日 09:18:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052425.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定