2023年1月9日 18:47:04go评论113阅读模式

英文:

Python: merge dataframes and keep all values in cells if not identical

问题

我试图合并多个Excel文件。每个文件的维度可能不同。某些文件可能具有相同的列名称，其中数据可以为NULL、相同或不同。我编写的脚本将多个具有不同维度的文件合并，并删除具有相同名称的重复列，最后一个值在最终列单元格中被删除。但是，我试图连接不相等的值，以便用户可以在Excel中手动查看重复的数据。

示例：
用户1在df表中的年龄为24，而在df1中的年龄为27。我试图在最终的合并输出中获得这两个值。

输入：
df

user	age	team
1	24	x
2	56	y
3	32	z

df = pd.DataFrame({'user': ['1', '2', '3'],
                    'age': [24, 56, 32],
                    'team': ['x', 'y', 'z']})

df1

user	age	name
1	27	Ronald
2	NaN	Eugene
4	44	Jeff
5	61	Britney

df = pd.DataFrame({'user': ['1','2','4','5'],
                    'age': [27,NaN,44,61],
                    'name': ['Ronald','Eugene','Jeff','Britney']})

期望的输出：

情况：

两个相同的值：保留一个
一个值为NaN：保留非NaN值
两个不同的值：使用分隔符连接，以便稍后审查。我会突出显示它。

user	age	team	name
1	24		27
2	56	y	Eugene
3	32	z	NaN
4	44	NaN	Jeff
5	61	NaN	Britney

这是我目前的代码。用户将文件放入指定的文件夹，然后循环遍历所有Excel文件。第一次循环将数据附加到df数据框中，每个后续循环都是合并。问题是，我只从最后一个循环获取值（如果不为空）。

df = pd.DataFrame()
for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT + excel_files, dtype=str)
        print(excel_files)
        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df, df1, on=UNIQUE_KEY, how=JOIN_METHOD, suffixes=('', '_dupe'))
            df.drop([column for column in df.columns if '_dupe' in column], axis=1, inplace=True)

这是输出的样子：

user	age	team	name
1	27	x	Ronald
2	56	y	Eugene
3	32	z	NaN
4	44	NaN	Jeff
5	61	NaN	Britney

尝试循环遍历列然后连接。我可以看到df[new_col]中的组合值，但它无法更新df数据框，最终的输出显示为NaN。

df = pd.DataFrame()
for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT + excel_files, dtype=str)
        print(excel_files)
        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df, df1, on=UNIQUE_KEY, how=JOIN_METHOD, suffixes=('', '_dupe'))
            cols_to_remove = df.columns
            for column in cols_to_remove:
                if "_dupe" in column:
                    new_col = str(column).replace('_dupe', '')
                    df[new_col] = df[new_col].str.cat(df[column], sep='||')
                    print('New Values:', df[new_col])
                    df.pop(column)

任何帮助将不胜感激。谢谢 Raf

英文:

So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.

EXAMPLE:
User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.

INPUT:
df

user	age	team
1	24	x
2	56	y
3	32	z

df = pd.DataFrame({&#39;user&#39;: [&#39;1&#39;, &#39;2&#39;, &#39;3&#39;],
                    &#39;age&#39;: [24,56,32],
                    &#39;team&#39;: [x,y,z]})

df1

user	age	name
1	27	Ronald
2	NaN	Eugene
4	44	Jeff
5	61	Britney

df = pd.DataFrame({&#39;user&#39;: [&#39;1&#39;,&#39;2&#39;,&#39;4&#39;,&#39;5&#39;],
                    &#39;age&#39;: [27,NaN,44,61],
                    &#39;name&#39;: [&#39;Ronald&#39;,&#39;Eugene&#39;,&#39;Jeff&#39;,&#39;Britney&#39;]})

EXPECTED OUTPUT:

CASES:

two identical values: keep one
one value is NaN: keep non NaN value
two different values: concat with delimiter so it can be review later. I will highlight it.

user	age	team	name
1	24		27
2	56	y	Eugene
3	32	z	NaN
4	44	NaN	Jeff
5	61	NaN	Britney

Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.

df = pd.DataFrame()
for excel_files in FILELIST:
    if excel_files.endswith(&quot;.xlsx&quot;):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        print(excel_files)
        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=(&#39;&#39;,&#39;_dupe&#39;))
            df.drop([column for column in df.columns if &#39;_dupe&#39; in column],axis=1, inplace=True)

That's what the OUTPUT looks like

user	age	team	name
1	27	x	Ronald
2	56	y	Eugene
3	32	z	NaN
4	44	NaN	Jeff
5	61	NaN	Britney

Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.

df = pd.DataFrame()
for excel_files in FILELIST:
    if excel_files.endswith(&quot;.xlsx&quot;):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        #df1.set_index(&#39;uid&#39;,inplace=True)
        print(excel_files)
        #print(df1)
        #print(df1.dtypes)
        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=(&#39;&#39;,&#39;_dupe&#39;))
            #df.drop([column for column in df.columns if &#39;_dupe&#39; in column],axis=1, inplace=True)
            cols_to_remove = df.columns
            for column in cols_to_remove:
                if &quot;_dupe&quot; in column:
                    new_col = str(column).replace(&#39;_dupe&#39;,&#39;&#39;)
                    df[new_col] = df[new_col].str.cat(df[column],sep=&#39;||&#39;)
                    print(&#39;New Values: &#39;,df[new_col])
                    df.pop(column)

Any help will be appreciated. Thanks Raf

答案1

得分: 1

我会执行merge，然后对列应用groupby.agg：

merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))
out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg('last')
)

输出：

  user   age  team     name
0    1  27.0     x   Ronald
1    2  56.0     y   Eugene
2    3  32.0     z     None
3    4  44.0  None     Jeff
4    5  61.0  None  Britney

替代输出：

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)

输出：

  user        age team     name
0    1  24.0|27.0    x   Ronald
1    2       56.0    y   Eugene
2    3       32.0    z         
3    4       44.0          Jeff
4    5       61.0       Britney

英文:

I would merge, then apply a groupby.agg on columns:

merged = df.merge(df1, on=&#39;user&#39;, how=&#39;outer&#39;, suffixes=(&#39;&#39;, &#39;_dupe&#39;))
out = (merged
 .groupby(merged.columns.str.replace(&#39;_dupe&#39;, &#39;&#39;), sort=False, axis=1)
 .agg(&#39;last&#39;)
)

Output:

  user   age  team     name
0    1  27.0     x   Ronald
1    2  56.0     y   Eugene
2    3  32.0     z     None
3    4  44.0  None     Jeff
4    5  61.0  None  Britney

Alterntive output:

out = (merged
 .groupby(merged.columns.str.replace(&#39;_dupe&#39;, &#39;&#39;), sort=False, axis=1)
 .agg(lambda g: g.agg(lambda s: &#39;|&#39;.join(s.dropna().unique().astype(str)), axis=1))
)

Output:

  user        age team     name
0    1  24.0|27.0    x   Ronald
1    2       56.0    y   Eugene
2    3       32.0    z         
3    4       44.0          Jeff
4    5       61.0       Britney

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python: 合并数据框并保留所有单元格中的值，如果不相同。

问题

答案1

Pyinstaller在将.py文件转换为可执行文件时未包含FoxitSDK模块。

如何使用Python API修改Azure块Blob中的特定块？

读取一个文本文件，根据分隔符将其拆分为多行。

Python Playwright 无法访问元素。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。