2023年2月6日 09:39:54go评论99阅读模式

英文:

Python delete rows for each group after first occurance in a column

问题

以下是您要翻译的内容：

I Have a dataframe as follows:

df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
                   'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
                   'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})

I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:

Required Output

My Approach

Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()

Is there a better approach, if not then help me in code for dropping the rows that occur after 'H' Activity in each group.

Thanks in advance !

英文:

I Have a dataframe as follows:

df = pd.DataFrame({&#39;Key&#39;:[1,1,1,1,2,2,2,4,4,4,5,5],
                   &#39;Activity&#39;:[&#39;A&#39;,&#39;A&#39;,&#39;H&#39;,&#39;B&#39;,&#39;B&#39;,&#39;H&#39;,&#39;H&#39;,&#39;A&#39;,&#39;C&#39;,&#39;H&#39;,&#39;H&#39;,&#39;B&#39;],
                   &#39;Date&#39;:[&#39;2022-12-03&#39;,&#39;2022-12-04&#39;,&#39;2022-12-06&#39;,&#39;2022-12-08&#39;,&#39;2022-12-03&#39;,&#39;2022-12-06&#39;,&#39;2022-12-10&#39;,&#39;2022-12-03&#39;,&#39;2022-12-04&#39;,&#39;2022-12-07&#39;,&#39;2022-12-03&#39;,&#39;2022-12-13&#39;]})

I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:

Required Output

My Approach

Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()

Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.

Thanks in advance !

答案1

得分: 1

你可以将“H”日期“带回”到每一行以便进行比较。

首先在一个新列中标记每个“H”日期：

df.loc[df["Activity"] == "H", "End"] = df["Date"]

    Key Activity        Date         End
0     1        A  2022-12-03         NaT
1     1        A  2022-12-04         NaT
2     1        H  2022-12-06  2022-12-06
3     1        B  2022-12-08         NaT
4     2        B  2022-12-03         NaT
5     2        H  2022-12-06  2022-12-06
6     2        H  2022-12-10  2022-12-10
7     4        A  2022-12-03         NaT
8     4        C  2022-12-04         NaT
9     4        H  2022-12-07  2022-12-07
10    5        H  2022-12-03  2022-12-03
11    5        B  2022-12-13         NaT

对每个分组向后填充新列：

df["End"] = df.groupby("Key")["End"].bfill()

    Key Activity        Date         End
0     1        A  2022-12-03  2022-12-06
1     1        A  2022-12-04  2022-12-06
2     1        H  2022-12-06  2022-12-06
3     1        B  2022-12-08         NaT
4     2        B  2022-12-03  2022-12-06
5     2        H  2022-12-06  2022-12-06
6     2        H  2022-12-10  2022-12-10
7     4        A  2022-12-03  2022-12-07
8     4        C  2022-12-04  2022-12-07
9     4        H  2022-12-07  2022-12-07
10    5        H  2022-12-03  2022-12-03
11    5        B  2022-12-13         NaT

然后可以选择Date在End之前的行：

df.loc[df["Date"] < df["End"]]

   Key Activity        Date         End
0    1        A  2022-12-03  2022-12-06
1    1        A  2022-12-04  2022-12-06
4    2        B  2022-12-03  2022-12-06
7    4        A  2022-12-03  2022-12-07
8    4        C  2022-12-04  2022-12-07

生成最终形式时，可以使用.pivot_table()：

(df.loc[df["Date"] < df["End"]]
   .pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
   .reindex(df["Key"].unique())  # 添加没有匹配的键，例如 `5`
   .fillna(0)
   .astype(int))

Activity  A  B  C
Key              
1         2  0  0
2         0  1  0
4         1  0  1
5         0  0  0

英文:

You can bring the H dates "back" into each previous row to use in a comparison.

First mark each H date in a new column:

df.loc[df[&quot;Activity&quot;] == &quot;H&quot; , &quot;End&quot;] = df[&quot;Date&quot;]

    Key Activity        Date         End
0     1        A  2022-12-03         NaT
1     1        A  2022-12-04         NaT
2     1        H  2022-12-06  2022-12-06
3     1        B  2022-12-08         NaT
4     2        B  2022-12-03         NaT
5     2        H  2022-12-06  2022-12-06
6     2        H  2022-12-10  2022-12-10
7     4        A  2022-12-03         NaT
8     4        C  2022-12-04         NaT
9     4        H  2022-12-07  2022-12-07
10    5        H  2022-12-03  2022-12-03
11    5        B  2022-12-13         NaT

Backward fill the new column for each group:

df[&quot;End&quot;] = df.groupby(&quot;Key&quot;)[&quot;End&quot;].bfill()

    Key Activity        Date         End
0     1        A  2022-12-03  2022-12-06
1     1        A  2022-12-04  2022-12-06
2     1        H  2022-12-06  2022-12-06
3     1        B  2022-12-08         NaT
4     2        B  2022-12-03  2022-12-06
5     2        H  2022-12-06  2022-12-06
6     2        H  2022-12-10  2022-12-10
7     4        A  2022-12-03  2022-12-07
8     4        C  2022-12-04  2022-12-07
9     4        H  2022-12-07  2022-12-07
10    5        H  2022-12-03  2022-12-03
11    5        B  2022-12-13         NaT

You can then select rows with Date before End

df.loc[df[&quot;Date&quot;] &lt; df[&quot;End&quot;]]

   Key Activity        Date         End
0    1        A  2022-12-03  2022-12-06
1    1        A  2022-12-04  2022-12-06
4    2        B  2022-12-03  2022-12-06
7    4        A  2022-12-03  2022-12-07
8    4        C  2022-12-04  2022-12-07

To generate the final form - you can use .pivot_table()

(df.loc[df[&quot;Date&quot;] &lt; df[&quot;End&quot;]]
   .pivot_table(index=&quot;Key&quot;, columns=&quot;Activity&quot;, values=&quot;Date&quot;, aggfunc=&quot;count&quot;)
   .reindex(df[&quot;Key&quot;].unique()) # Add in keys with no match e.g. `5`
   .fillna(0)
   .astype(int))

Activity  A  B  C
Key              
1         2  0  0
2         0  1  0
4         1  0  1
5         0  0  0

答案2

得分: 1

以下是您要翻译的代码部分：

(df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
.set_index('Key')['Activity']
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df['Key'].unique(), fill_value=0)
.reset_index())
或
(df['Activity'].where(df['Activity'].ne('H').groupby(df['Key']).cumprod())
.str.get_dummies()
.groupby(df['Key']).sum())

英文:

Try this:

(df.loc[df[&#39;Activity&#39;].eq(&#39;H&#39;).groupby(df[&#39;Key&#39;]).cumsum().eq(0)]
.set_index(&#39;Key&#39;)[&#39;Activity&#39;]
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df[&#39;Key&#39;].unique(),fill_value=0)
.reset_index())

(df[&#39;Activity&#39;].where(df[&#39;Activity&#39;].ne(&#39;H&#39;).groupby(df[&#39;Key&#39;]).cumprod())
.str.get_dummies()
.groupby(df[&#39;Key&#39;]).sum())

Output:

   Key  A  B  C
0    1  2  0  0
1    2  0  1  0
2    4  1  0  1
3    5  0  0  0

答案3

得分: 1

# 按键和日期排序
df.sort_values(['Key', 'Date'], inplace=True)
# 当筛选后没有保留值时，保持Key在结果中
df.Key = df.Key.astype('category')
# 对每个Key在第一个H之后的所有行进行筛选，然后进行数据透视
df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
  index='Key', columns='Activity', aggfunc='size'
).reset_index()
# 活动 键  A  B  C
# 0    1  2  0  0
# 1    2  0  1  0
# 2    4  1  0  1
# 3    5  0  0  0

英文:

You can try:

# sort by Key and Date
df.sort_values([&#39;Key&#39;, &#39;Date&#39;], inplace=True)
# this is to keep Key in the result when no values are kept after the filter
df.Key = df.Key.astype(&#39;category&#39;)
# filter all rows after the 1st H for each Key and then pivot
df[~df.Activity.eq(&#39;H&#39;).groupby(df.Key).cummax()].pivot_table(
  index=&#39;Key&#39;, columns=&#39;Activity&#39;, aggfunc=&#39;size&#39;
).reset_index()
#Activity Key  A  B  C
#0          1  2  0  0
#1          2  0  1  0
#2          4  1  0  1
#3          5  0  0  0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python删除每个分组中第一次出现后的行

问题

答案1

答案2

答案3

旅行推销员问题使用遗传算法

传递`special_flags`参数给pygame中的group.draw。

How to read data from azure blob storage with BlobServiceClient without downloading rather by using BytesIO stream

初始化fsspec DirFileSystem从URL

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论