如果两行之间的某一列数值匹配,根据条件保留带有第三列数值的较新行。

huangapple go评论57阅读模式
英文:

If a column value matches between 2 rows, keep the newer row with a value in a third column based on a condition

问题

我有以下的pandas DataFrame:

id    time     type
 1      t1     create
 1      t2     modify
 2      t3     modify
 2      t4     deploy
 3      t5     delete

时间是递增的,t2比t1新,t3比t2新,以此类推。如果类型是modify,那么会有另一行,类型不是modify。否则,会只有一行,类型不是modify。对于包含modify的两行,我想保留更新的那一行(tn更大),但我希望type的值不是modify。所以,我想将上面的表格更改为以下表格:

id. time. type

  1. t2.     create
    
  2. t4.     deploy
    
  3. t5.     delete
    

如果id相匹配,我希望保留具有最新时间和类型不是modify的行。换句话说,如果有一个id没有modify,那么将只有一行,我想保留它。如果有一个modify,那么将会有另一行,具有相同的id。我想保留对于这对中最新的那一行,如果modify是最新的,我想覆盖modify的类型值为较旧行中type列的值。

我知道如何保留相同id的更新行,但我不知道如何确保如果非modify类型比较旧,那么非modify类型的值会被复制到类型字段中,替换更新行中的modify值并保留更新行,然后删除较旧行。有人可以教我如何做吗?

这是我能想出的全部:

```python
 df.loc[df.groupby('id')['time'].idxmax(), type != 'modify']

看起来我可以简化我的问题:

  id.   time.   type
   1     t1    create
   1     t2    modify

基于前面的DataFrame,我想生成以下结果:

    id.   time.    type
     1     t2     create

<details>
<summary>英文:</summary>

I have the following pandas DataFrame:

id    time     type
 1      t1     create
 1      t2     modify
 2      t3     modify
 2      t4     deploy
 3      t5     delete
The time is increasing with t2 newer than t1, t3 new than t2, etc.  If there is a modify in type, then there will be another row with a type that is not modify.  Otherwise, there will be only one row with a type other than modify.  For the 2 rows that contain a modify, I want to keep whichever row is new (larger tn) but I want the value in type that is not modify.  So, I want the table above to be changed to the following table:

id. time. type

  1. t2.     create
    
  2. t4.     deploy
    
  3. t5.     delete
    
  

Where the ids match I want to keep the row with the newest time with the type that is not
modify.  In other words, if there is an id without a modify, then there will be one row and I want to keep it.  If there is a modify, then there will be another row with the same id.  I want to keep whichever row is the newer of the pair, and if the modify is newer, I want to overwrite the type of modify with the value that is in the type column of the older row. 

I know how to keep the newer row with the same id but I don&#39;t know how to ensure that if the non-modify type is older, that the non-modify type value is copied into the type field replacing the the value modify in the newer row and keeping the newer row and then dropping the older row.  Can someone show me how to do this?

This is all I have been able to come up with:
 df.loc[df.groupby(&#39;id&#39;)[&#39;time&#39;].idxmax(), type != modify]


It looks like I can simplify my question down to:

id. time. type
1 t1 create
1 t2 modify
`''

Based on the previous df, I want to produce the following:

   
    id.   time.    type
     1     t2     create

答案1

得分: 1

IIUC,对于每个ID分组,您想要找到:

  1. 最新的time值。我怀疑实际上这些是类似时间戳的东西,但在这个示例中它们是简单的字符串。
  2. 不是modifytype

我们可以使用.groupby来完成这两个任务,但最有效的解决方案是使用.groupby来解决①,然后使用简单的过滤器来解决②。

EDIT:.drop_duplicates('id', keep='last')也可以用来回答①,并且可能比.groupby(...).last()更快。

out = (
    df.groupby('id')[['time']].last()
    .assign(
        type=df.set_index('id').loc[lambda d: d['type'] != 'modify', 'type']
    )
)

print(out)
   time    type
id             
1    t2  create
2    t4  deploy
3    t5  delete

或者,如果您不关心计算的效率,您可以使用一个带有lambda函数的.groupby来解决②。

out = (
    df.groupby('id')
    .agg(
        time=('time', 'last'), 
        type=('type', lambda s: s.loc
展开收缩
)
) ) print(out) time type id 1 t2 create 2 t4 deploy 3 t5 delete
英文:

IIUC, for each grouping of ID- you want to find:

  1. What is the latest time value. I suspect in reality these are something like timestamps, but in this example they're simple strings.
  2. What is the type value that is NOT modify

We can use .groupby to accomplish both of these, but the most effective solution will be to use .groupby to solve ①, and then use a simple filter for ②.

EDIT: .drop_duplicates(&#39;id&#39;, keep=&#39;last&#39;) will also work to answer ① and will probably be faster than .groupby(...).last()

out = (
    df.groupby(&#39;id&#39;)[[&#39;time&#39;]].last()
    .assign(
        type=df.set_index(&#39;id&#39;).loc[lambda d: d[&#39;type&#39;] != &#39;modify&#39;, &#39;type&#39;]
    )
)

print(out)
   time    type
id             
1    t2  create
2    t4  deploy
3    t5  delete

Alternatively, if you're not concerned with the efficacy of the computation you can use a .groupby with a lambda to solve ② instead.

out = (
    df.groupby(&#39;id&#39;)
    .agg(
        time=(&#39;time&#39;, &#39;last&#39;), 
        type=(&#39;type&#39;, lambda s: s.loc
展开收缩
)
) ) print(out) time type id 1 t2 create 2 t4 deploy 3 t5 delete

答案2

得分: 0

Sure, here are the translated code portions:

首先,看起来你可以移除"modify",然后保留每个组的最后一行:

df.loc[df['type'] != 'modify'].drop_duplicates('id', keep='last')

输出:

   id time    type
0   1   t1  create
3   2   t4  deploy
4   3   t5  delete

根据"modify"更新时间:

m = df['type'] == 'modify'

out = (df.assign(time=df['time'].where(m)
                      .groupby(df['id']).bfill()
                      .fillna(df['time'])
                )
         .loc[~m]
         .drop_duplicates(subset=['id'], keep='last')
      )

输出:

   id time    type
0   1   t2  create
3   2   t4  deploy
4   3   t5  delete

如果你需要更多帮助,请告诉我。

英文:

Looks like you could first remove the "modify", then keep the last row per group:

df.loc[df[&#39;type&#39;].ne(&#39;modify&#39;)].drop_duplicates(&#39;id&#39;, keep=&#39;last&#39;)

Output:

   id time    type
0   1   t1  create
3   2   t4  deploy
4   3   t5  delete

updating the time based on "modify"

m = df[&#39;type&#39;].eq(&#39;modify&#39;)

out = (df.assign(time=df[&#39;time&#39;].where(m)
                      .groupby(df[&#39;id&#39;]).bfill()
                      .fillna(df[&#39;time&#39;])
                )
         .loc[~m]
         .drop_duplicates(subset=[&#39;id&#39;], keep=&#39;last&#39;)
      )

Output:

   id time    type
0   1   t2  create
3   2   t4  deploy
4   3   t5  delete

huangapple
  • 本文由 发表于 2023年5月22日 21:44:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76306838.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定