获取数据框中的重复行并覆盖它们 Python

huangapple go评论73阅读模式
英文:

Get Duplicated Rows in Dataframe and Overwrite them Python

问题

我有以下的数据框:

index errorId start end timestamp uniqueId
0 1404 2022-04-25 02:10:41 2022-04-25 02:10:46 2022-04-25 1404_2022-04-25
1 1302 2022-04-25 02:10:41 2022-04-25 02:10:46 2022-04-25 1302_2022-04-25
2 1404 2022-04-27 12:54:46 2022-04-27 12:54:51 2022-04-25 1404_2022-04-25
3 1302 2022-04-27 13:34:43 2022-04-27 13:34:50 2022-04-25 1302_2022-04-25
4 1404 2022-04-29 04:30:22 2022-04-29 04:30:29 2022-04-25 1404_2022-04-25
5 1302 2022-04-29 08:26:25 2022-04-29 08:26:32 2022-04-25 1302_2022-04-25

uniqueId是从列errorId和uniqueId组合而成的。应该检查列'uniqueID'是否包含重复值。如果是这样,应该选择首次出现的行。在示例中,对于errorId 1404,它将是索引0的列。然后,应该将列'end'中的值覆盖为最后一次出现的值。在此示例中,是索引4的位置。

对于errorId 1302也是一样的。

最后的结果应该如下所示:

index errorId start end timestamp uniqueId
0 1404 2022-04-25 02:10:41 2022-04-29 04:30:29 2022-04-25 1404_2022-04-25
1 1302 2022-04-25 02:10:41 2022-04-29 08:26:32 2022-04-25 1302_2022-04-25
英文:

I have the following Dataframe:

index errorId start end timestamp uniqueId
0 1404 2022-04-25 02:10:41 2022-04-25 02:10:46 2022-04-25 1404_2022-04-25
1 1302 2022-04-25 02:10:41 2022-04-25 02:10:46 2022-04-25 1302_2022-04-25
2 1404 2022-04-27 12:54:46 2022-04-27 12:54:51 2022-04-25 1404_2022-04-25
3 1302 2022-04-27 13:34:43 2022-04-27 13:34:50 2022-04-25 1302_2022-04-25
4 1404 2022-04-29 04:30:22 2022-04-29 04:30:29 2022-04-25 1404_2022-04-25
5 1302 2022-04-29 08:26:25 2022-04-29 08:26:32 2022-04-25 1302_2022-04-25

The unique_ID is a combination from the column errorId and uniqueId.
It should be checked whether the column 'uniqueID' contains a duplicate value. If this is the case, the row should be taken where it appears for the first time. In the example for errorId 1404, it would be the column at index 0. Afterwards, the value in the column 'end' should be overwritten with the value where it appears for the last time. In the example here, at index 4.<br>
The same for errorId 1302

In the End it should look like this:

index errorId start end timestamp uniqueId
0 1404 2022-04-25 02:10:41 2022-04-29 04:30:29 2022-04-25 1404_2022-04-25
1 1302 2022-04-25 02:10:41 2022-04-29 08:26:32 2022-04-25 1302_2022-04-25

答案1

得分: 2

我认为您需要对3列进行minmax的聚合,并使用命名聚合按照原始列的顺序进行,就像使用DataFrame.reindex一样:

df1 = (df.groupby(['errorId','timestamp','uniqueId'], as_index=False, sort=False)
         .agg(start=('start','min'), end=('end','max'))
         .reindex(df.columns, axis=1))

或者如果日期时间已按组排序,可以通过firstlast进行聚合以获得相同的输出:

df2 = (df.groupby(['errorId','timestamp','uniqueId'], as_index=False, sort=False)
         .agg(start=('start','first'), end=('end','last'))
         .reindex(df.columns, axis=1))
英文:

I think you need aggragate min and max per 3 columns with named aggregation, last for same order of columns like original add DataFrame.reindex:

df1 = (df.groupby([&#39;errorId&#39;,&#39;timestamp&#39;,&#39;uniqueId&#39;], as_index=False, sort=False)
         .agg(start=(&#39;start&#39;,&#39;min&#39;), end=(&#39;end&#39;,&#39;max&#39;))
         .reindex(df.columns, axis=1))

Or aggregate by first and last, if datetimes are sorted per groups get same ouput:

df2 = (df.groupby([&#39;errorId&#39;,&#39;timestamp&#39;,&#39;uniqueId&#39;], as_index=False, sort=False)
         .agg(start=(&#39;start&#39;,&#39;first&#39;), end=(&#39;end&#39;,&#39;last&#39;))
         .reindex(df.columns, axis=1))

huangapple
  • 本文由 发表于 2023年1月9日 17:51:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75055530.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定