英文:
Python, Loop to reading in Excel file sheets, change header row number
问题
我有一个循环,用来计算xls文件中每个工作表的行数。但当我打开xls文件本身时,计数与Python返回的结果不一致。
这是因为第一个工作表的标题在第3行。如何修改我的代码以仅在第3行读取第一个工作表并忽略前两行?我的其他工作表始终从顶行开始,不包含标题。我想计算第一个工作表的长度,不包括标题。
然而,当我打开我的Excel并计算我的工作表时,我得到以下结果:
65522,标题从第3行开始,期望计数为65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427
我的完整代码:
from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk
def process_files(files: list) -> pd.DataFrame:
file_mapping = {}
for file in files:
archive = ZipFile(file)
files_in_archive = archive.namelist()
excel_files_in_archive = [
f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
]
assert len(excel_files_in_archive) == 1
data_mapping = pd.read_excel(
BytesIO(archive.read(excel_files_in_archive[0])),
sheet_name=None, header=None,
)
row_counts = []
for sheet in list(data_mapping.keys()):
if sheet == 'Sheet1':
df = data_mapping.get(sheet)[3:]
else:
df = data_mapping.get(sheet)
row_counts.append(len(df))
print(len(data_mapping.get(sheet)))
file_mapping.update({file: sum(row_counts)})
frame = pd.DataFrame([file_mapping]).transpose().reset_index()
frame.columns = ["file_name", "row_counts"]
return frame
dir_path = r'D:\test22 - 10'
zip_files = []
for root, dirs, files in os.walk(dir_path):
for file in files:
if file.endswith('.zip'):
zip_files.append(os.path.join(root, file))
df = process_files(zip_files)
是否有人知道我做错了什么?
英文:
I have a loop that counts the rows in each sheet of an xls. When I open the xls itself the count is not aligning with what python is returning me.
It is due to the first sheet header being in row 3. How can I alter my code to read the first sheet ONLY in at row 3 and ignore the first two lines? The rest of my sheets ALWAYS start at the top row and contain no header. I would like to count the len of my first sheet without header included.
However when I open up my excel and count my sheet I am getting
65522 , header starts in row 3, expecting a count of 65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427
my full code:
from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk
def process_files(files: list) -> pd.DataFrame:
file_mapping = {}
for file in files:
#data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
archive = ZipFile(file)
# find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
files_in_archive = archive.namelist()
excel_files_in_archive = [
f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
]
# ensure we only have one file (otherwise, loop or choose one somehow)
assert len(excel_files_in_archive) == 1
# read in data
data_mapping = pd.read_excel(
BytesIO(archive.read(excel_files_in_archive[0])),
sheet_name=None, header=None,
)
row_counts = []
for sheet in list(data_mapping.keys()):
if sheet == 'Sheet1':
df = data_mapping.get(sheet)[3:]
else:
df = data_mapping.get(sheet)
row_counts.append(len(df))
print(len(data_mapping.get(sheet)))
file_mapping.update({file: sum(row_counts)})
frame = pd.DataFrame([file_mapping]).transpose().reset_index()
frame.columns = ["file_name", "row_counts"]
return frame
dir_path = r'D:\test22 - 10'
zip_files = []
for root, dirs, files in os.walk(dir_path):
for file in files:
if file.endswith('.zip'):
zip_files.append(os.path.join(root, file))
df = process_files(zip_files) #function
does anyone have an idea on what im doing wrong?
答案1
得分: 4
你只需要使用skiprows参数:
# 读取数据
data_mapping = pd.read_excel(
BytesIO(archive.read(excel_files_in_archive[0])),
sheet_name=None, header=None, skiprows=2
)
或者不使用skiprows,然后直接切片工作表的数据框:
row_counts = []
for sheet in list(data_mapping.keys()):
if sheet == '第一个工作表的名称':
df = data_mapping.get(sheet)[3:]
else:
df = data_mapping.get(sheet)
row_counts.append(len(df))
print(len(data_mapping.get(sheet)))
## 或者根据列表中的位置进行切片,不需要在.keys()上调用list()
for sheet, i in enumerate(data_mapping.keys()):
if i == 0:
df = data_mapping.get(sheet)[3:]
else:
df = data_mapping.get(sheet)
row_counts.append(len(df))
print(len(data_mapping.get(sheet)))
英文:
You just need to use the skiprows argument:
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
# read in data
data_mapping = pd.read_excel(
BytesIO(archive.read(excel_files_in_archive[0])),
sheet_name=None, header=None, skiprows=2
)
or don't use skiprows and then slice the sheet's dataframe directly:
row_counts = []
for sheet in list(data_mapping.keys()):
if sheet == 'name of first sheet':
df = data_mapping.get(sheet)[3:]
else:
df = data_mapping.get(sheet)
row_counts.append(len(df))
print(len(data_mapping.get(sheet)))
##or based on the location in the list. you don't need to call list() on .keys()
for sheet, i in enumerate(data_mapping.keys()):
if i == 0:
df = data_mapping.get(sheet)[3:]
else:
df = data_mapping.get(sheet)
row_counts.append(len(df))
print(len(data_mapping.get(sheet)))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论