英文:
List of list of DataFrame overwrites previous values (pandas, python)
问题
我有一组Excel文件,我想要总结一些数据。每个Excel文件中的数据分布在5个工作表中。现在我想创建一个新的Excel文件,其中包含5个工作表,每个工作表中汇总了所有Excel文件的相应工作表数据。
我想要采用的方法是创建一个DataFrame的列表,其中每一行包含来自所有文件相应工作表的数据,然后将每一行连接起来,这样我最终会得到5个DataFrame,可以将它们写入新Excel文件的5个工作表中。我为此创建的代码如下:
import glob
import pandas as pd
from tkinter import filedialog
def select_base_path():
root = filedialog.askdirectory(title='选择基本路径', mustexist=True)
return root
if __name__ == '__main__':
base_path = select_base_path()
files = []
for file in glob.glob(str(base_path) + '/**/10x 0.45 SFR average .xlsx', recursive=True):
files.append(file)
sheets = ['Center', 'north west', 'south west', 'north east', 'south east']
Frames = [[pd.DataFrame()] * len(files)] * len(sheets)
data_frames = [[]] * len(sheets)
ids = []
for k in range(len(files):
ids.append('Adapter ' + files[k][files[k].find('#'):files[k].find('#') + 3])
for i, file in enumerate(files):
if i == 0:
for j, sheet in enumerate(sheets):
if sheet == 'Center':
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='A,E,G,K,M,Q,S,W')
else:
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='A,E,G,K')
else:
for j, sheet in enumerate(sheets):
if sheet == 'Center':
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='E,K,Q,W')
else:
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='E,K')
for m in range(len(sheets)):
data_frames[m] = pd.concat(Frames[m], axis=1, keys=ids)
这个代码的问题在于,当它遍历Frames时,它不会写入到列表的列表中的单个位置Frames[j, i],而是每次遍历工作表时都会写入到Frames[:, i],因此会覆盖数据。这导致了在最后i的切片都是相同的情况。
当在调试器中查看之后(i=0,j=0),我已经在Frames[:, i]中得到了数据。我期望只在Frames[j, i]中有数据。我的误解在哪里?
英文:
I have a set of Excel files, where I want to summarize some data. The data in one Excel file is spread over 5 sheets. I now want to create a new excel file with 5 sheets, where on every sheet the data of all Excel files is summarized for the respective sheet.
The way I wanted to go, is to create a list of list of DataFrame, where on each row the data from a respective sheet of all files is collected and later on concatenate each row, so I end up with 5 DataFrames I can write to 5 sheets of a new Excel file. The code I created for this, looks like:
import glob
import pandas as pd
from tkinter import filedialog
def select_base_path():
root = filedialog.askdirectory(
title='Select base path',
mustexist=True)
return root
if __name__ == '__main__':
base_path = select_base_path()
files = []
for file in glob.glob(str(base_path) + '\**\x 0.45 SFR average .xlsx', recursive=True):
files.append(file)
sheets = ['Center', 'north west', 'south west', 'north east', 'south east']
Frames = [[pd.DataFrame()] * len(files)] * len(sheets)
data_frames = [[]] * len(sheets)
ids = []
for k in range(len(files)):
ids.append('Adapter ' + files[k][files[k].find('#'):files[k].find('#')+3])
for i, file in enumerate(files):
if i == 0:
for j, sheet in enumerate(sheets):
if sheet == 'Center':
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='A,E,G,K,M,Q,S,W')
else:
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='A,E,G,K')
else:
for j, sheet in enumerate(sheets):
if sheet == 'Center':
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='E,K,Q,W')
else:
Frames[j][i] = pd.read_excel(io=file, sheet_name=sheet, header=2, usecols='E,K')
for m in range(len(sheets)):
data_frames[m] = pd.concat(Frames[m], axis=1, keys=ids)
The problem I am facing with this is that when it iterates through the Frames, it does not write to a single location Frames[j,i] in the list of list of DataFrame, but instead writes the data to Frames[:,i] and therefore overwriting the data, every time it is iterating through the sheets. This ends in the fact, that the slices in i are all identical in the end.
When having a look at the debugger, after the first pass (i=0, j=0) I already end up with data in Frames[:,i]. I expect just having data in Frames[j, i]. Where is my misconception here?
答案1
得分: 2
一个列表推导式生成新的不相关对象,可能会避免观察到的问题。
英文:
A list comprehension produces new unrelated objects and may avoid the observed problem
Frames2 = [[pd.DataFrame() for i in range(len(files))]
for j in range(len(sheets))]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论