英文:
Python: merge dataframes and keep all values in cells if not identical
问题
我试图合并多个Excel文件。每个文件的维度可能不同。某些文件可能具有相同的列名称,其中数据可以为NULL、相同或不同。我编写的脚本将多个具有不同维度的文件合并,并删除具有相同名称的重复列,最后一个值在最终列单元格中被删除。但是,我试图连接不相等的值,以便用户可以在Excel中手动查看重复的数据。
示例:
用户1在df表中的年龄为24,而在df1中的年龄为27。我试图在最终的合并输出中获得这两个值。
输入:
df
user | age | team |
---|---|---|
1 | 24 | x |
2 | 56 | y |
3 | 32 | z |
df = pd.DataFrame({'user': ['1', '2', '3'],
'age': [24, 56, 32],
'team': ['x', 'y', 'z']})
df1
user | age | name |
---|---|---|
1 | 27 | Ronald |
2 | NaN | Eugene |
4 | 44 | Jeff |
5 | 61 | Britney |
df = pd.DataFrame({'user': ['1','2','4','5'],
'age': [27,NaN,44,61],
'name': ['Ronald','Eugene','Jeff','Britney']})
期望的输出:
情况:
-
两个相同的值:保留一个
-
一个值为NaN:保留非NaN值
-
两个不同的值:使用分隔符连接,以便稍后审查。我会突出显示它。
user | age | team | name |
---|---|---|---|
1 | 24 | 27 | |
2 | 56 | y | Eugene |
3 | 32 | z | NaN |
4 | 44 | NaN | Jeff |
5 | 61 | NaN | Britney |
这是我目前的代码。用户将文件放入指定的文件夹,然后循环遍历所有Excel文件。第一次循环将数据附加到df数据框中,每个后续循环都是合并。问题是,我只从最后一个循环获取值(如果不为空)。
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT + excel_files, dtype=str)
print(excel_files)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df, df1, on=UNIQUE_KEY, how=JOIN_METHOD, suffixes=('', '_dupe'))
df.drop([column for column in df.columns if '_dupe' in column], axis=1, inplace=True)
这是输出的样子:
user | age | team | name |
---|---|---|---|
1 | 27 | x | Ronald |
2 | 56 | y | Eugene |
3 | 32 | z | NaN |
4 | 44 | NaN | Jeff |
5 | 61 | NaN | Britney |
尝试循环遍历列然后连接。我可以看到df[new_col]中的组合值,但它无法更新df数据框,最终的输出显示为NaN。
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT + excel_files, dtype=str)
print(excel_files)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df, df1, on=UNIQUE_KEY, how=JOIN_METHOD, suffixes=('', '_dupe'))
cols_to_remove = df.columns
for column in cols_to_remove:
if "_dupe" in column:
new_col = str(column).replace('_dupe', '')
df[new_col] = df[new_col].str.cat(df[column], sep='||')
print('New Values:', df[new_col])
df.pop(column)
任何帮助将不胜感激。谢谢 Raf
英文:
So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.
EXAMPLE:
User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.
INPUT:
df
user | age | team |
---|---|---|
1 | 24 | x |
2 | 56 | y |
3 | 32 | z |
df = pd.DataFrame({'user': ['1', '2', '3'],
'age': [24,56,32],
'team': [x,y,z]})
df1
user | age | name |
---|---|---|
1 | 27 | Ronald |
2 | NaN | Eugene |
4 | 44 | Jeff |
5 | 61 | Britney |
df = pd.DataFrame({'user': ['1','2','4','5'],
'age': [27,NaN,44,61],
'name': ['Ronald','Eugene','Jeff','Britney']})
EXPECTED OUTPUT:
CASES:
-
two identical values: keep one
-
one value is NaN: keep non NaN value
-
two different values: concat with delimiter so it can be review later. I will highlight it.
user | age | team | name |
---|---|---|---|
1 | 24 | 27 | |
2 | 56 | y | Eugene |
3 | 32 | z | NaN |
4 | 44 | NaN | Jeff |
5 | 61 | NaN | Britney |
Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
print(excel_files)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)
That's what the OUTPUT looks like
user | age | team | name |
---|---|---|---|
1 | 27 | x | Ronald |
2 | 56 | y | Eugene |
3 | 32 | z | NaN |
4 | 44 | NaN | Jeff |
5 | 61 | NaN | Britney |
Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
#df1.set_index('uid',inplace=True)
print(excel_files)
#print(df1)
#print(df1.dtypes)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
#df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)
cols_to_remove = df.columns
for column in cols_to_remove:
if "_dupe" in column:
new_col = str(column).replace('_dupe','')
df[new_col] = df[new_col].str.cat(df[column],sep='||')
print('New Values: ',df[new_col])
df.pop(column)
Any help will be appreciated. Thanks Raf
答案1
得分: 1
我会执行merge
,然后对列应用groupby.agg
:
merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg('last')
)
输出:
user age team name
0 1 27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z None
3 4 44.0 None Jeff
4 5 61.0 None Britney
替代输出:
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)
输出:
user age team name
0 1 24.0|27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z
3 4 44.0 Jeff
4 5 61.0 Britney
英文:
I would merge
, then apply a groupby.agg
on columns:
merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg('last')
)
Output:
user age team name
0 1 27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z None
3 4 44.0 None Jeff
4 5 61.0 None Britney
Alterntive output:
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)
Output:
user age team name
0 1 24.0|27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z
3 4 44.0 Jeff
4 5 61.0 Britney
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论