英文:
If a column value matches between 2 rows, keep the newer row with a value in a third column based on a condition
问题
我有以下的pandas DataFrame:
id time type
1 t1 create
1 t2 modify
2 t3 modify
2 t4 deploy
3 t5 delete
时间是递增的,t2比t1新,t3比t2新,以此类推。如果类型是modify,那么会有另一行,类型不是modify。否则,会只有一行,类型不是modify。对于包含modify的两行,我想保留更新的那一行(tn更大),但我希望type的值不是modify。所以,我想将上面的表格更改为以下表格:
id. time. type
-
t2. create
-
t4. deploy
-
t5. delete
如果id相匹配,我希望保留具有最新时间和类型不是modify的行。换句话说,如果有一个id没有modify,那么将只有一行,我想保留它。如果有一个modify,那么将会有另一行,具有相同的id。我想保留对于这对中最新的那一行,如果modify是最新的,我想覆盖modify的类型值为较旧行中type列的值。
我知道如何保留相同id的更新行,但我不知道如何确保如果非modify类型比较旧,那么非modify类型的值会被复制到类型字段中,替换更新行中的modify值并保留更新行,然后删除较旧行。有人可以教我如何做吗?
这是我能想出的全部:
```python
df.loc[df.groupby('id')['time'].idxmax(), type != 'modify']
看起来我可以简化我的问题:
id. time. type
1 t1 create
1 t2 modify
基于前面的DataFrame,我想生成以下结果:
id. time. type
1 t2 create
<details>
<summary>英文:</summary>
I have the following pandas DataFrame:
id time type
1 t1 create
1 t2 modify
2 t3 modify
2 t4 deploy
3 t5 delete
The time is increasing with t2 newer than t1, t3 new than t2, etc. If there is a modify in type, then there will be another row with a type that is not modify. Otherwise, there will be only one row with a type other than modify. For the 2 rows that contain a modify, I want to keep whichever row is new (larger tn) but I want the value in type that is not modify. So, I want the table above to be changed to the following table:
id. time. type
-
t2. create
-
t4. deploy
-
t5. delete
Where the ids match I want to keep the row with the newest time with the type that is not
modify. In other words, if there is an id without a modify, then there will be one row and I want to keep it. If there is a modify, then there will be another row with the same id. I want to keep whichever row is the newer of the pair, and if the modify is newer, I want to overwrite the type of modify with the value that is in the type column of the older row.
I know how to keep the newer row with the same id but I don't know how to ensure that if the non-modify type is older, that the non-modify type value is copied into the type field replacing the the value modify in the newer row and keeping the newer row and then dropping the older row. Can someone show me how to do this?
This is all I have been able to come up with:
df.loc[df.groupby('id')['time'].idxmax(), type != modify]
It looks like I can simplify my question down to:
id. time. type
1 t1 create
1 t2 modify
`''
Based on the previous df, I want to produce the following:
id. time. type
1 t2 create
答案1
得分: 1
IIUC,对于每个ID分组,您想要找到:
- 最新的
time
值。我怀疑实际上这些是类似时间戳的东西,但在这个示例中它们是简单的字符串。 - 不是
modify
的type
值
我们可以使用.groupby
来完成这两个任务,但最有效的解决方案是使用.groupby
来解决①,然后使用简单的过滤器来解决②。
EDIT:.drop_duplicates('id', keep='last')
也可以用来回答①,并且可能比.groupby(...).last()
更快。
out = (
df.groupby('id')[['time']].last()
.assign(
type=df.set_index('id').loc[lambda d: d['type'] != 'modify', 'type']
)
)
print(out)
time type
id
1 t2 create
2 t4 deploy
3 t5 delete
或者,如果您不关心计算的效率,您可以使用一个带有lambda函数的.groupby
来解决②。
out = (
df.groupby('id')
.agg(
time=('time', 'last'),
type=('type', lambda s: s.loc展开收缩)
)
)
print(out)
time type
id
1 t2 create
2 t4 deploy
3 t5 delete
英文:
IIUC, for each grouping of ID- you want to find:
- What is the latest
time
value. I suspect in reality these are something like timestamps, but in this example they're simple strings. - What is the
type
value that is NOT modify
We can use .groupby
to accomplish both of these, but the most effective solution will be to use .groupby
to solve ①, and then use a simple filter for ②.
EDIT: .drop_duplicates('id', keep='last')
will also work to answer ① and will probably be faster than .groupby(...).last()
out = (
df.groupby('id')[['time']].last()
.assign(
type=df.set_index('id').loc[lambda d: d['type'] != 'modify', 'type']
)
)
print(out)
time type
id
1 t2 create
2 t4 deploy
3 t5 delete
Alternatively, if you're not concerned with the efficacy of the computation you can use a .groupby
with a lambda to solve ② instead.
out = (
df.groupby('id')
.agg(
time=('time', 'last'),
type=('type', lambda s: s.loc展开收缩)
)
)
print(out)
time type
id
1 t2 create
2 t4 deploy
3 t5 delete
答案2
得分: 0
Sure, here are the translated code portions:
首先,看起来你可以移除"modify",然后保留每个组的最后一行:
df.loc[df['type'] != 'modify'].drop_duplicates('id', keep='last')
输出:
id time type
0 1 t1 create
3 2 t4 deploy
4 3 t5 delete
根据"modify"更新时间:
m = df['type'] == 'modify'
out = (df.assign(time=df['time'].where(m)
.groupby(df['id']).bfill()
.fillna(df['time'])
)
.loc[~m]
.drop_duplicates(subset=['id'], keep='last')
)
输出:
id time type
0 1 t2 create
3 2 t4 deploy
4 3 t5 delete
如果你需要更多帮助,请告诉我。
英文:
Looks like you could first remove the "modify", then keep the last row per group:
df.loc[df['type'].ne('modify')].drop_duplicates('id', keep='last')
Output:
id time type
0 1 t1 create
3 2 t4 deploy
4 3 t5 delete
updating the time based on "modify"
m = df['type'].eq('modify')
out = (df.assign(time=df['time'].where(m)
.groupby(df['id']).bfill()
.fillna(df['time'])
)
.loc[~m]
.drop_duplicates(subset=['id'], keep='last')
)
Output:
id time type
0 1 t2 create
3 2 t4 deploy
4 3 t5 delete
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论