英文:
Get Duplicated Rows in Dataframe and Overwrite them Python
问题
我有以下的数据框:
index | errorId | start | end | timestamp | uniqueId |
---|---|---|---|---|---|
0 | 1404 | 2022-04-25 02:10:41 | 2022-04-25 02:10:46 | 2022-04-25 | 1404_2022-04-25 |
1 | 1302 | 2022-04-25 02:10:41 | 2022-04-25 02:10:46 | 2022-04-25 | 1302_2022-04-25 |
2 | 1404 | 2022-04-27 12:54:46 | 2022-04-27 12:54:51 | 2022-04-25 | 1404_2022-04-25 |
3 | 1302 | 2022-04-27 13:34:43 | 2022-04-27 13:34:50 | 2022-04-25 | 1302_2022-04-25 |
4 | 1404 | 2022-04-29 04:30:22 | 2022-04-29 04:30:29 | 2022-04-25 | 1404_2022-04-25 |
5 | 1302 | 2022-04-29 08:26:25 | 2022-04-29 08:26:32 | 2022-04-25 | 1302_2022-04-25 |
uniqueId是从列errorId和uniqueId组合而成的。应该检查列'uniqueID'是否包含重复值。如果是这样,应该选择首次出现的行。在示例中,对于errorId 1404,它将是索引0的列。然后,应该将列'end'中的值覆盖为最后一次出现的值。在此示例中,是索引4的位置。
对于errorId 1302也是一样的。
最后的结果应该如下所示:
index | errorId | start | end | timestamp | uniqueId |
---|---|---|---|---|---|
0 | 1404 | 2022-04-25 02:10:41 | 2022-04-29 04:30:29 | 2022-04-25 | 1404_2022-04-25 |
1 | 1302 | 2022-04-25 02:10:41 | 2022-04-29 08:26:32 | 2022-04-25 | 1302_2022-04-25 |
英文:
I have the following Dataframe:
index | errorId | start | end | timestamp | uniqueId |
---|---|---|---|---|---|
0 | 1404 | 2022-04-25 02:10:41 | 2022-04-25 02:10:46 | 2022-04-25 | 1404_2022-04-25 |
1 | 1302 | 2022-04-25 02:10:41 | 2022-04-25 02:10:46 | 2022-04-25 | 1302_2022-04-25 |
2 | 1404 | 2022-04-27 12:54:46 | 2022-04-27 12:54:51 | 2022-04-25 | 1404_2022-04-25 |
3 | 1302 | 2022-04-27 13:34:43 | 2022-04-27 13:34:50 | 2022-04-25 | 1302_2022-04-25 |
4 | 1404 | 2022-04-29 04:30:22 | 2022-04-29 04:30:29 | 2022-04-25 | 1404_2022-04-25 |
5 | 1302 | 2022-04-29 08:26:25 | 2022-04-29 08:26:32 | 2022-04-25 | 1302_2022-04-25 |
The unique_ID is a combination from the column errorId and uniqueId.
It should be checked whether the column 'uniqueID' contains a duplicate value. If this is the case, the row should be taken where it appears for the first time. In the example for errorId 1404, it would be the column at index 0. Afterwards, the value in the column 'end' should be overwritten with the value where it appears for the last time. In the example here, at index 4.<br>
The same for errorId 1302
In the End it should look like this:
index | errorId | start | end | timestamp | uniqueId |
---|---|---|---|---|---|
0 | 1404 | 2022-04-25 02:10:41 | 2022-04-29 04:30:29 | 2022-04-25 | 1404_2022-04-25 |
1 | 1302 | 2022-04-25 02:10:41 | 2022-04-29 08:26:32 | 2022-04-25 | 1302_2022-04-25 |
答案1
得分: 2
我认为您需要对3列进行min
和max
的聚合,并使用命名聚合按照原始列的顺序进行,就像使用DataFrame.reindex
一样:
df1 = (df.groupby(['errorId','timestamp','uniqueId'], as_index=False, sort=False)
.agg(start=('start','min'), end=('end','max'))
.reindex(df.columns, axis=1))
或者如果日期时间已按组排序,可以通过first
和last
进行聚合以获得相同的输出:
df2 = (df.groupby(['errorId','timestamp','uniqueId'], as_index=False, sort=False)
.agg(start=('start','first'), end=('end','last'))
.reindex(df.columns, axis=1))
英文:
I think you need aggragate min
and max
per 3 columns with named aggregation, last for same order of columns like original add DataFrame.reindex
:
df1 = (df.groupby(['errorId','timestamp','uniqueId'], as_index=False, sort=False)
.agg(start=('start','min'), end=('end','max'))
.reindex(df.columns, axis=1))
Or aggregate by first
and last
, if datetimes are sorted per groups get same ouput:
df2 = (df.groupby(['errorId','timestamp','uniqueId'], as_index=False, sort=False)
.agg(start=('start','first'), end=('end','last'))
.reindex(df.columns, axis=1))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论