Python: 合并数据框并保留所有单元格中的值,如果不相同。

huangapple go评论79阅读模式
英文:

Python: merge dataframes and keep all values in cells if not identical

问题

我试图合并多个Excel文件。每个文件的维度可能不同。某些文件可能具有相同的列名称,其中数据可以为NULL、相同或不同。我编写的脚本将多个具有不同维度的文件合并,并删除具有相同名称的重复列,最后一个值在最终列单元格中被删除。但是,我试图连接不相等的值,以便用户可以在Excel中手动查看重复的数据。

示例:
用户1在df表中的年龄为24,而在df1中的年龄为27。我试图在最终的合并输出中获得这两个值。

输入:
df

user age team
1 24 x
2 56 y
3 32 z
df = pd.DataFrame({'user': ['1', '2', '3'],
                    'age': [24, 56, 32],
                    'team': ['x', 'y', 'z']})

df1

user age name
1 27 Ronald
2 NaN Eugene
4 44 Jeff
5 61 Britney
df = pd.DataFrame({'user': ['1','2','4','5'],
                    'age': [27,NaN,44,61],
                    'name': ['Ronald','Eugene','Jeff','Britney']})

期望的输出:

情况:

  1. 两个相同的值:保留一个

  2. 一个值为NaN:保留非NaN值

  3. 两个不同的值:使用分隔符连接,以便稍后审查。我会突出显示它。

user age team name
1 24 27
2 56 y Eugene
3 32 z NaN
4 44 NaN Jeff
5 61 NaN Britney

这是我目前的代码。用户将文件放入指定的文件夹,然后循环遍历所有Excel文件。第一次循环将数据附加到df数据框中,每个后续循环都是合并。问题是,我只从最后一个循环获取值(如果不为空)。

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT + excel_files, dtype=str)
        print(excel_files)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df, df1, on=UNIQUE_KEY, how=JOIN_METHOD, suffixes=('', '_dupe'))
            df.drop([column for column in df.columns if '_dupe' in column], axis=1, inplace=True)

这是输出的样子:

user age team name
1 27 x Ronald
2 56 y Eugene
3 32 z NaN
4 44 NaN Jeff
5 61 NaN Britney

尝试循环遍历列然后连接。我可以看到df[new_col]中的组合值,但它无法更新df数据框,最终的输出显示为NaN。

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT + excel_files, dtype=str)
        print(excel_files)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df, df1, on=UNIQUE_KEY, how=JOIN_METHOD, suffixes=('', '_dupe'))

            cols_to_remove = df.columns
            for column in cols_to_remove:
                if "_dupe" in column:
                    new_col = str(column).replace('_dupe', '')

                    df[new_col] = df[new_col].str.cat(df[column], sep='||')
                    print('New Values:', df[new_col])
                    df.pop(column)

任何帮助将不胜感激。谢谢 Raf

英文:

So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.

EXAMPLE:
User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.

INPUT:
df

user age team
1 24 x
2 56 y
3 32 z
df = pd.DataFrame({'user': ['1', '2', '3'],
                    'age': [24,56,32],
                    'team': [x,y,z]})

df1

user age name
1 27 Ronald
2 NaN Eugene
4 44 Jeff
5 61 Britney
df = pd.DataFrame({'user': ['1','2','4','5'],
                    'age': [27,NaN,44,61],
                    'name': ['Ronald','Eugene','Jeff','Britney']})

EXPECTED OUTPUT:

CASES:

  1. two identical values: keep one

  2. one value is NaN: keep non NaN value

  3. two different values: concat with delimiter so it can be review later. I will highlight it.

user age team name
1 24 27
2 56 y Eugene
3 32 z NaN
4 44 NaN Jeff
5 61 NaN Britney

Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        print(excel_files)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
            df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)

That's what the OUTPUT looks like

user age team name
1 27 x Ronald
2 56 y Eugene
3 32 z NaN
4 44 NaN Jeff
5 61 NaN Britney

Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        #df1.set_index('uid',inplace=True)
        print(excel_files)
        #print(df1)
        #print(df1.dtypes)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
            #df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)

            cols_to_remove = df.columns
            for column in cols_to_remove:
                if "_dupe" in column:
                    new_col = str(column).replace('_dupe','')

                    df[new_col] = df[new_col].str.cat(df[column],sep='||')
                    print('New Values: ',df[new_col])
                    df.pop(column)

Any help will be appreciated. Thanks Raf

答案1

得分: 1

我会执行merge,然后对列应用groupby.agg

merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg('last')
)

输出:

  user   age  team     name
0    1  27.0     x   Ronald
1    2  56.0     y   Eugene
2    3  32.0     z     None
3    4  44.0  None     Jeff
4    5  61.0  None  Britney

替代输出:

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)

输出:

  user        age team     name
0    1  24.0|27.0    x   Ronald
1    2       56.0    y   Eugene
2    3       32.0    z         
3    4       44.0          Jeff
4    5       61.0       Britney
英文:

I would merge, then apply a groupby.agg on columns:

merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg('last')
)

Output:

  user   age  team     name
0    1  27.0     x   Ronald
1    2  56.0     y   Eugene
2    3  32.0     z     None
3    4  44.0  None     Jeff
4    5  61.0  None  Britney

Alterntive output:

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)

Output:

  user        age team     name
0    1  24.0|27.0    x   Ronald
1    2       56.0    y   Eugene
2    3       32.0    z         
3    4       44.0          Jeff
4    5       61.0       Britney

huangapple
  • 本文由 发表于 2023年1月9日 18:47:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056155.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定