2023年5月22日 21:44:41go评论57阅读模式

英文:

If a column value matches between 2 rows, keep the newer row with a value in a third column based on a condition

问题

我有以下的pandas DataFrame:

id    time     type
 1      t1     create
 1      t2     modify
 2      t3     modify
 2      t4     deploy
 3      t5     delete


时间是递增的，t2比t1新，t3比t2新，以此类推。如果类型是modify，那么会有另一行，类型不是modify。否则，会只有一行，类型不是modify。对于包含modify的两行，我想保留更新的那一行（tn更大），但我希望type的值不是modify。所以，我想将上面的表格更改为以下表格：

id. time. type

```
t2.     create
```
```
t4.     deploy
```
```
t5.     delete
```


如果id相匹配，我希望保留具有最新时间和类型不是modify的行。换句话说，如果有一个id没有modify，那么将只有一行，我想保留它。如果有一个modify，那么将会有另一行，具有相同的id。我想保留对于这对中最新的那一行，如果modify是最新的，我想覆盖modify的类型值为较旧行中type列的值。

我知道如何保留相同id的更新行，但我不知道如何确保如果非modify类型比较旧，那么非modify类型的值会被复制到类型字段中，替换更新行中的modify值并保留更新行，然后删除较旧行。有人可以教我如何做吗？

这是我能想出的全部：

```python
 df.loc[df.groupby('id')['time'].idxmax(), type != 'modify']

看起来我可以简化我的问题：

  id.   time.   type
   1     t1    create
   1     t2    modify

基于前面的DataFrame，我想生成以下结果：

    id.   time.    type
     1     t2     create


<details>
<summary>英文:</summary>

I have the following pandas DataFrame:

id    time     type
 1      t1     create
 1      t2     modify
 2      t3     modify
 2      t4     deploy
 3      t5     delete

The time is increasing with t2 newer than t1, t3 new than t2, etc.  If there is a modify in type, then there will be another row with a type that is not modify.  Otherwise, there will be only one row with a type other than modify.  For the 2 rows that contain a modify, I want to keep whichever row is new (larger tn) but I want the value in type that is not modify.  So, I want the table above to be changed to the following table:

id. time. type

```
t2.     create
```
```
t4.     deploy
```
```
t5.     delete
```

  

Where the ids match I want to keep the row with the newest time with the type that is not
modify.  In other words, if there is an id without a modify, then there will be one row and I want to keep it.  If there is a modify, then there will be another row with the same id.  I want to keep whichever row is the newer of the pair, and if the modify is newer, I want to overwrite the type of modify with the value that is in the type column of the older row. 

I know how to keep the newer row with the same id but I don&#39;t know how to ensure that if the non-modify type is older, that the non-modify type value is copied into the type field replacing the the value modify in the newer row and keeping the newer row and then dropping the older row.  Can someone show me how to do this?

This is all I have been able to come up with:
 df.loc[df.groupby(&#39;id&#39;)[&#39;time&#39;].idxmax(), type != modify]


It looks like I can simplify my question down to:

id. time. type
1 t1 create
1 t2 modify
`''

Based on the previous df, I want to produce the following:

   
    id.   time.    type
     1     t2     create

答案1

得分: 1

IIUC，对于每个ID分组，您想要找到：

最新的time值。我怀疑实际上这些是类似时间戳的东西，但在这个示例中它们是简单的字符串。
不是modify的type值

我们可以使用.groupby来完成这两个任务，但最有效的解决方案是使用.groupby来解决①，然后使用简单的过滤器来解决②。

EDIT：.drop_duplicates('id', keep='last')也可以用来回答①，并且可能比.groupby(...).last()更快。

out = (
    df.groupby('id')[['time']].last()
    .assign(
        type=df.set_index('id').loc[lambda d: d['type'] != 'modify', 'type']
    )
)

print(out)
   time    type
id             
1    t2  create
2    t4  deploy
3    t5  delete

或者，如果您不关心计算的效率，您可以使用一个带有lambda函数的.groupby来解决②。

out = (
    df.groupby('id')
    .agg(
        time=('time', 'last'), 
        type=('type', lambda s: s.loc展开收缩)
    )
)

print(out)
   time    type
id             
1    t2  create
2    t4  deploy
3    t5  delete

英文:

IIUC, for each grouping of ID- you want to find:

What is the latest time value. I suspect in reality these are something like timestamps, but in this example they're simple strings.
What is the type value that is NOT modify

We can use .groupby to accomplish both of these, but the most effective solution will be to use .groupby to solve ①, and then use a simple filter for ②.

EDIT: .drop_duplicates('id', keep='last') will also work to answer ① and will probably be faster than .groupby(...).last()

out = (
    df.groupby(&#39;id&#39;)[[&#39;time&#39;]].last()
    .assign(
        type=df.set_index(&#39;id&#39;).loc[lambda d: d[&#39;type&#39;] != &#39;modify&#39;, &#39;type&#39;]
    )
)

print(out)
   time    type
id             
1    t2  create
2    t4  deploy
3    t5  delete

Alternatively, if you're not concerned with the efficacy of the computation you can use a .groupby with a lambda to solve ② instead.

out = (
    df.groupby(&#39;id&#39;)
    .agg(
        time=(&#39;time&#39;, &#39;last&#39;), 
        type=(&#39;type&#39;, lambda s: s.loc展开收缩
)
    )
)

print(out)
   time    type
id             
1    t2  create
2    t4  deploy
3    t5  delete

答案2

得分: 0

Sure, here are the translated code portions:

首先，看起来你可以移除"modify"，然后保留每个组的最后一行：

df.loc[df['type'] != 'modify'].drop_duplicates('id', keep='last')

输出：

   id time    type
0   1   t1  create
3   2   t4  deploy
4   3   t5  delete

根据"modify"更新时间：

m = df['type'] == 'modify'

out = (df.assign(time=df['time'].where(m)
                      .groupby(df['id']).bfill()
                      .fillna(df['time'])
                )
         .loc[~m]
         .drop_duplicates(subset=['id'], keep='last')
      )

输出：

   id time    type
0   1   t2  create
3   2   t4  deploy
4   3   t5  delete

如果你需要更多帮助，请告诉我。

英文:

Looks like you could first remove the "modify", then keep the last row per group:

df.loc[df[&#39;type&#39;].ne(&#39;modify&#39;)].drop_duplicates(&#39;id&#39;, keep=&#39;last&#39;)

Output:

   id time    type
0   1   t1  create
3   2   t4  deploy
4   3   t5  delete

updating the time based on "modify"

m = df[&#39;type&#39;].eq(&#39;modify&#39;)

out = (df.assign(time=df[&#39;time&#39;].where(m)
                      .groupby(df[&#39;id&#39;]).bfill()
                      .fillna(df[&#39;time&#39;])
                )
         .loc[~m]
         .drop_duplicates(subset=[&#39;id&#39;], keep=&#39;last&#39;)
      )

Output:

   id time    type
0   1   t2  create
3   2   t4  deploy
4   3   t5  delete

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如果两行之间的某一列数值匹配，根据条件保留带有第三列数值的较新行。

问题

答案1

答案2

updating the time based on "modify"

无法找到引用的已定义小部件，正在创建交互式图表。

Pandas列值排列

根据参考表填写数值。

Pandas基于另一个Pandas列的总和创建新列

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论