寻找两个数据框中的变更条目

huangapple go评论63阅读模式
英文:

Find changed entries in 2 Data frames

问题

上个月的数据表:

| 用户名     | 电子邮件               | 类型   | 许可证数量 |
| ---      | ---                | --- | ---       |
| Joe Blue | joe@codeschool.co.uk | 管理员 | 1         |
| Rachel Green | racehel@codeschool.co.uk | 用户 | 2         |
| Shirly Brown | shirley@codeschool.co.uk | 管理员 | 1         |
| Jack Black | Jack@codeschool.co.uk | 管理员 | 1         |
| Cheryl Stone | Cherylt@codeschool.co.uk | 管理员 | 1         |

本月的数据表:

| 用户名     | 电子邮件               | 类型   | 许可证数量 |
| ---      | ---                | --- | ---       |
| Joe Blue | joe@codeschool.co.uk | 管理员 | 1         |
| Rachel Green | racehel@codeschool.co.uk | 用户 | 2         |
| Richard Red | Richard@codeschool.co.uk | 管理员 | 1         |
| Jack Black | Jack@codeschool.co.uk | 管理员 | 1         |

期望的输出:
DF_Added(已添加的用户):

| 用户名         | 电子邮件               | 类型   | 许可证数量 |
| ---          | ---                | --- | ---       |
| Richard Red | Richard@codeschool.co.uk | 管理员 | 1         |

DF_Removed(已移除的用户):

| 用户名            | 电子邮件               | 类型   | 许可证数量 |
| ---             | ---                | --- | ---       |
| Cheryl Stone | Cherylt@codeschool.co.uk | 管理员 | 1         |

您提供的代码中,您尝试使用Python的Pandas库来执行这些任务,但似乎有一些问题。如果您需要关于代码的帮助,可以提出具体问题,我将尽力协助您。

英文:

I have 2 Excel sheets generated monthly containing user data from an app we pay licenses for. I need to find a list of users who were added this month and a list of users removed from last month's spreadsheet. I need to compare the Email column as this is the only unique column in my data

Last months DF

| UserName | Email | Type | Number Of Licenses |
| --- | --- | --- | --- |
| Joe Blue | joe@codeschool.co.uk | Admin | 1 |
| Rachel Green | racehel@codeschool.co.uk | User | 2 |
| Shirly Brown | shirley@codeschool.co.uk | Adimin | 1 |
| Jack Black | Jack@codeschool.co.uk | Adimin | 1 |
| Cheryl Stone | Cherylt@codeschool.co.uk | Adimin | 1 |

This month's DF

| UserName | Email | Type | Number Of Licenses |
| --- | --- | --- | --- |
| Joe Blue | joe@codeschool.co.uk | Admin | 1 |
| Rachel Green | racehel@codeschool.co.uk | User | 2 |
| Richard Red | Richard@codeschool.co.uk | Adimin | 1 |
| Jack Black | Jack@codeschool.co.uk | Adimin | 1 |

Desired Outputs
DF_Added

| UserName | Email | Type | Number Of Licenses |
| --- | --- | --- | --- |
| Richard Red | Richard@codeschool.co.uk | Adimin | 1 |

DF_Removed

| UserName | Email | Type | Number Of Licenses |
| --- | --- | --- | --- |
| Cheryl Stone | Cherylt@codeschool.co.uk | Adimin | 1 |

I tried several ways but can't get the desired output.

My current not working code

```
LastMonth = Path.cwd() / "./tempwork/UserReport_June.xlsx"
ThisMonth = Path.cwd() / "./tempwork/UserReport_July.xlsx"
df_LastMonth = pd.read_excel(LastMonth)
df_ThisMonth = pd.read_excel(ThisMonth)
df_changed  = pd.merge(df_LastMonth, df_ThisMonth, on="Email", 
how='outer', indicator='Exist')
df_changed  = df_changed.loc[df_changed['Exist'] != 'both']
df_changed.to_excel("./tempwork/changed.xlsx")
```

答案1

得分: 1

可以使用.merge并带有indicator=参数。然后使用left_only/right_only指示器来过滤已添加/已删除的行。例如:

x = pd.merge(df_last, df_current, on='Email', how='outer', indicator=True)

# 已添加
added = x.loc[x['_merge'] == 'right_only', ['Email', 'UserName_y', 'Type_y', 'Number Of Licenses_y']]
added.columns = [c.split('_')[0] for c in added.columns]

# 已删除
removed = x.loc[x['_merge'] == 'left_only', ['Email', 'UserName_x', 'Type_x', 'Number Of Licenses_x']]
removed.columns = [c.split('_')[0] for c in removed.columns]

print(added)
print(removed)

打印结果:

                      Email     UserName    Type  Number Of Licenses
5  Richard@codeschool.co.uk  Richard Red  Adimin                 1.0

                      Email      UserName    Type  Number Of Licenses
2  shirley@codeschool.co.uk  Shirly Brown  Adimin                 1.0
4  Cherylt@codeschool.co.uk  Cheryl Stone  Adimin                 1.0
英文:

You can use .merge with indicator= parameter. Then use left_only/right_only indicator to filter rows that were added/removed. E.g.:

x = pd.merge(df_last, df_current, on='Email', how='outer', indicator=True)

# added
added = x.loc[x['_merge'] == 'right_only', ['Email', 'UserName_y', 'Type_y', 'Number Of Licenses_y']]
added.columns = [c.split('_')[0] for c in added.columns]

# removed
removed = x.loc[x['_merge'] == 'left_only', ['Email', 'UserName_x', 'Type_x', 'Number Of Licenses_x']]
removed.columns = [c.split('_')[0] for c in removed.columns]

print(added)
print(removed)

Prints:

                      Email     UserName    Type  Number Of Licenses
5  Richard@codeschool.co.uk  Richard Red  Adimin                 1.0

                      Email      UserName    Type  Number Of Licenses
2  shirley@codeschool.co.uk  Shirly Brown  Adimin                 1.0
4  Cherylt@codeschool.co.uk  Cheryl Stone  Adimin                 1.0

答案2

得分: 0

如果我正确理解了你的问题,你的问题意味着你需要第一个数据框存在但第二个数据框中不存在的行(这些人已取消订阅),以及第二个数据框存在但第一个数据框中不存在的行(这些人是新订阅的),为了执行这个任务,我们可以这样做:

# 查找第一个数据框中不存在于第二个数据框中的行(已取消订阅的用户)
un_subscribed_users = df1.merge(df2, on=list(df1.columns), how='left', indicator=True)
un_subscribed_users = (un_subscribed_users[un_subscribed_users['_merge'] == 'left_only']).drop(columns=["_merge"])

print(un_subscribed_users)

# 查找第二个数据框中不存在于第一个数据框中的行(新订阅的用户)
subscribed_users = df1.merge(df2, on=list(df1.columns), how='right', indicator=True)
subscribed_users = (subscribed_users[subscribed_users['_merge'] == 'right_only']).drop(columns=["_merge"])

print(subscribed_users)

只需微调/更改 on 参数以仅比较你希望用来比较记录的字段。

英文:

If I understood your question correctly, your question means you need the rows that exist in first dataframe but not in the second (those are whom unsubscribed), and also the rows that exist in the second dataframe but not in the first (those are whom newly subscribed), so to do this task we can do it as:

# Find rows in df1 that do not exist in df2
un_subscribed_users = df1.merge(df2, on=list(df1.columns), how='left', indicator=True)
un_subscribed_users = (un_subscribed_users[un_subscribed_users['_merge'] == 'left_only']).drop(columns=["_merge"])

print(un_subscribed_users)

# Find rows in df2 that do not exist in df1
subscribed_users = df1.merge(df2, on=list(df1.columns), how='right', indicator=True)
subscribed_users = (subscribed_users[subscribed_users['_merge'] == 'right_only']).drop(columns=["_merge"])

print(subscribed_users)

just tweak / change on parameter to only the field you wish to compare records with.

答案3

得分: 0

根据您的需求,以下是已翻译的代码部分:

one_df = pd.read_csv("one.csv", sep="|")
two_df = pd.read_csv("two.csv", sep="|")

total = pd.merge(one_df, two_df, how="outer", on="Email")
added_df = total[~total["Email"].isin(one_df["Email"])]
removed_df = total[~total["Email"].isin(two_df["Email"])]
print(added_df)
print(removed_df)

根据添加和删除的列表,您可以找出是否有任何更改。

英文:

As I understood you need added or removed susbscribers from the list.

So here I fist create the list with total subscribers from last month and this month and then kind of subtracted from the total to get the new added of removed. Filtered with isin.

one_df = pd.read_csv("one.csv", sep="|")
two_df = pd.read_csv("two.csv", sep="|")

total = pd.merge(one_df, two_df, how="outer", on="Email")
added_df = total[~total["Email"].isin(one_df["Email"])]
removed_df = total[~total["Email"].isin(two_df["Email"])]
print(added_df)
print(removed_df)

Based on added and removed list you can find out if anything is changed or not.

huangapple
  • 本文由 发表于 2023年7月17日 19:56:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704213.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定