Python,循环读取Excel文件的工作表,更改标题行号

huangapple go评论146阅读模式
英文:

Python, Loop to reading in Excel file sheets, change header row number

问题

我有一个循环,用来计算xls文件中每个工作表的行数。但当我打开xls文件本身时,计数与Python返回的结果不一致。

这是因为第一个工作表的标题在第3行。如何修改我的代码以仅在第3行读取第一个工作表并忽略前两行?我的其他工作表始终从顶行开始,不包含标题。我想计算第一个工作表的长度,不包括标题。

然而,当我打开我的Excel并计算我的工作表时,我得到以下结果:

65522,标题从第3行开始,期望计数为65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

我的完整代码:

  1. from io import BytesIO
  2. from pathlib import Path
  3. from zipfile import ZipFile
  4. import os
  5. import pandas as pd
  6. from os import walk
  7. def process_files(files: list) -> pd.DataFrame:
  8. file_mapping = {}
  9. for file in files:
  10. archive = ZipFile(file)
  11. files_in_archive = archive.namelist()
  12. excel_files_in_archive = [
  13. f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
  14. ]
  15. assert len(excel_files_in_archive) == 1
  16. data_mapping = pd.read_excel(
  17. BytesIO(archive.read(excel_files_in_archive[0])),
  18. sheet_name=None, header=None,
  19. )
  20. row_counts = []
  21. for sheet in list(data_mapping.keys()):
  22. if sheet == 'Sheet1':
  23. df = data_mapping.get(sheet)[3:]
  24. else:
  25. df = data_mapping.get(sheet)
  26. row_counts.append(len(df))
  27. print(len(data_mapping.get(sheet)))
  28. file_mapping.update({file: sum(row_counts)})
  29. frame = pd.DataFrame([file_mapping]).transpose().reset_index()
  30. frame.columns = ["file_name", "row_counts"]
  31. return frame
  32. dir_path = r'D:\test22 - 10'
  33. zip_files = []
  34. for root, dirs, files in os.walk(dir_path):
  35. for file in files:
  36. if file.endswith('.zip'):
  37. zip_files.append(os.path.join(root, file))
  38. df = process_files(zip_files)

是否有人知道我做错了什么?

英文:

I have a loop that counts the rows in each sheet of an xls. When I open the xls itself the count is not aligning with what python is returning me.

It is due to the first sheet header being in row 3. How can I alter my code to read the first sheet ONLY in at row 3 and ignore the first two lines? The rest of my sheets ALWAYS start at the top row and contain no header. I would like to count the len of my first sheet without header included.

However when I open up my excel and count my sheet I am getting

  1. 65522 , header starts in row 3, expecting a count of 65520
  2. 65520
  3. 65520
  4. 65520
  5. 65520
  6. 65520
  7. 65520
  8. 65520
  9. 65520
  10. 65520
  11. 65520
  12. 25427

my full code:

  1. from io import BytesIO
  2. from pathlib import Path
  3. from zipfile import ZipFile
  4. import os
  5. import pandas as pd
  6. from os import walk
  7. def process_files(files: list) -> pd.DataFrame:
  8. file_mapping = {}
  9. for file in files:
  10. #data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
  11. archive = ZipFile(file)
  12. # find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
  13. files_in_archive = archive.namelist()
  14. excel_files_in_archive = [
  15. f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
  16. ]
  17. # ensure we only have one file (otherwise, loop or choose one somehow)
  18. assert len(excel_files_in_archive) == 1
  19. # read in data
  20. data_mapping = pd.read_excel(
  21. BytesIO(archive.read(excel_files_in_archive[0])),
  22. sheet_name=None, header=None,
  23. )
  24. row_counts = []
  25. for sheet in list(data_mapping.keys()):
  26. if sheet == 'Sheet1':
  27. df = data_mapping.get(sheet)[3:]
  28. else:
  29. df = data_mapping.get(sheet)
  30. row_counts.append(len(df))
  31. print(len(data_mapping.get(sheet)))
  32. file_mapping.update({file: sum(row_counts)})
  33. frame = pd.DataFrame([file_mapping]).transpose().reset_index()
  34. frame.columns = ["file_name", "row_counts"]
  35. return frame
  36. dir_path = r'D:\test22 - 10'
  37. zip_files = []
  38. for root, dirs, files in os.walk(dir_path):
  39. for file in files:
  40. if file.endswith('.zip'):
  41. zip_files.append(os.path.join(root, file))
  42. df = process_files(zip_files) #function

does anyone have an idea on what im doing wrong?

答案1

得分: 4

你只需要使用skiprows参数:

  1. # 读取数据
  2. data_mapping = pd.read_excel(
  3. BytesIO(archive.read(excel_files_in_archive[0])),
  4. sheet_name=None, header=None, skiprows=2
  5. )

或者不使用skiprows,然后直接切片工作表的数据框:

  1. row_counts = []
  2. for sheet in list(data_mapping.keys()):
  3. if sheet == '第一个工作表的名称':
  4. df = data_mapping.get(sheet)[3:]
  5. else:
  6. df = data_mapping.get(sheet)
  7. row_counts.append(len(df))
  8. print(len(data_mapping.get(sheet)))
  9. ## 或者根据列表中的位置进行切片,不需要在.keys()上调用list()
  10. for sheet, i in enumerate(data_mapping.keys()):
  11. if i == 0:
  12. df = data_mapping.get(sheet)[3:]
  13. else:
  14. df = data_mapping.get(sheet)
  15. row_counts.append(len(df))
  16. print(len(data_mapping.get(sheet)))
英文:

You just need to use the skiprows argument:
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

  1. # read in data
  2. data_mapping = pd.read_excel(
  3. BytesIO(archive.read(excel_files_in_archive[0])),
  4. sheet_name=None, header=None, skiprows=2
  5. )

or don't use skiprows and then slice the sheet's dataframe directly:

  1. row_counts = []
  2. for sheet in list(data_mapping.keys()):
  3. if sheet == 'name of first sheet':
  4. df = data_mapping.get(sheet)[3:]
  5. else:
  6. df = data_mapping.get(sheet)
  7. row_counts.append(len(df))
  8. print(len(data_mapping.get(sheet)))
  9. ##or based on the location in the list. you don't need to call list() on .keys()
  10. for sheet, i in enumerate(data_mapping.keys()):
  11. if i == 0:
  12. df = data_mapping.get(sheet)[3:]
  13. else:
  14. df = data_mapping.get(sheet)
  15. row_counts.append(len(df))
  16. print(len(data_mapping.get(sheet)))

huangapple
  • 本文由 发表于 2023年1月9日 09:18:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052425.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定