Python删除每个分组中第一次出现后的行

huangapple go评论99阅读模式
英文:

Python delete rows for each group after first occurance in a column

问题

以下是您要翻译的内容:

I Have a dataframe as follows:

  1. df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
  2. 'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
  3. 'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})

I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:

Required Output

My Approach

  1. Sort df by Key & Date ( Sample input is already sorted)
  2. drop the rows that occur after 'H' Activity in each group as follows:
  3. Groupby df.groupby(['Key', 'Activity']).count()

Is there a better approach, if not then help me in code for dropping the rows that occur after 'H' Activity in each group.

Thanks in advance !

英文:

I Have a dataframe as follows:

  1. df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
  2. 'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
  3. 'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})

Python删除每个分组中第一次出现后的行

I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:

Required Output

Python删除每个分组中第一次出现后的行

My Approach

  1. Sort df by Key & Date ( Sample input is already sorted)

  2. drop the rows that occur after 'H' Activity in each group as follows:

    Python删除每个分组中第一次出现后的行

  3. Groupby df.groupby(['Key', 'Activity']).count()

Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.

Thanks in advance !

答案1

得分: 1

你可以将“H”日期“带回”到每一行以便进行比较。

首先在一个新列中标记每个“H”日期:

  1. df.loc[df["Activity"] == "H", "End"] = df["Date"]
  1. Key Activity Date End
  2. 0 1 A 2022-12-03 NaT
  3. 1 1 A 2022-12-04 NaT
  4. 2 1 H 2022-12-06 2022-12-06
  5. 3 1 B 2022-12-08 NaT
  6. 4 2 B 2022-12-03 NaT
  7. 5 2 H 2022-12-06 2022-12-06
  8. 6 2 H 2022-12-10 2022-12-10
  9. 7 4 A 2022-12-03 NaT
  10. 8 4 C 2022-12-04 NaT
  11. 9 4 H 2022-12-07 2022-12-07
  12. 10 5 H 2022-12-03 2022-12-03
  13. 11 5 B 2022-12-13 NaT

对每个分组向后填充新列:

  1. df["End"] = df.groupby("Key")["End"].bfill()
  1. Key Activity Date End
  2. 0 1 A 2022-12-03 2022-12-06
  3. 1 1 A 2022-12-04 2022-12-06
  4. 2 1 H 2022-12-06 2022-12-06
  5. 3 1 B 2022-12-08 NaT
  6. 4 2 B 2022-12-03 2022-12-06
  7. 5 2 H 2022-12-06 2022-12-06
  8. 6 2 H 2022-12-10 2022-12-10
  9. 7 4 A 2022-12-03 2022-12-07
  10. 8 4 C 2022-12-04 2022-12-07
  11. 9 4 H 2022-12-07 2022-12-07
  12. 10 5 H 2022-12-03 2022-12-03
  13. 11 5 B 2022-12-13 NaT

然后可以选择DateEnd之前的行:

  1. df.loc[df["Date"] < df["End"]]
  1. Key Activity Date End
  2. 0 1 A 2022-12-03 2022-12-06
  3. 1 1 A 2022-12-04 2022-12-06
  4. 4 2 B 2022-12-03 2022-12-06
  5. 7 4 A 2022-12-03 2022-12-07
  6. 8 4 C 2022-12-04 2022-12-07

生成最终形式时,可以使用.pivot_table()

  1. (df.loc[df["Date"] < df["End"]]
  2. .pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
  3. .reindex(df["Key"].unique()) # 添加没有匹配的键,例如 `5`
  4. .fillna(0)
  5. .astype(int))
  1. Activity A B C
  2. Key
  3. 1 2 0 0
  4. 2 0 1 0
  5. 4 1 0 1
  6. 5 0 0 0
英文:

You can bring the H dates "back" into each previous row to use in a comparison.

First mark each H date in a new column:

  1. df.loc[df[&quot;Activity&quot;] == &quot;H&quot; , &quot;End&quot;] = df[&quot;Date&quot;]
  1. Key Activity Date End
  2. 0 1 A 2022-12-03 NaT
  3. 1 1 A 2022-12-04 NaT
  4. 2 1 H 2022-12-06 2022-12-06
  5. 3 1 B 2022-12-08 NaT
  6. 4 2 B 2022-12-03 NaT
  7. 5 2 H 2022-12-06 2022-12-06
  8. 6 2 H 2022-12-10 2022-12-10
  9. 7 4 A 2022-12-03 NaT
  10. 8 4 C 2022-12-04 NaT
  11. 9 4 H 2022-12-07 2022-12-07
  12. 10 5 H 2022-12-03 2022-12-03
  13. 11 5 B 2022-12-13 NaT

Backward fill the new column for each group:

  1. df[&quot;End&quot;] = df.groupby(&quot;Key&quot;)[&quot;End&quot;].bfill()
  1. Key Activity Date End
  2. 0 1 A 2022-12-03 2022-12-06
  3. 1 1 A 2022-12-04 2022-12-06
  4. 2 1 H 2022-12-06 2022-12-06
  5. 3 1 B 2022-12-08 NaT
  6. 4 2 B 2022-12-03 2022-12-06
  7. 5 2 H 2022-12-06 2022-12-06
  8. 6 2 H 2022-12-10 2022-12-10
  9. 7 4 A 2022-12-03 2022-12-07
  10. 8 4 C 2022-12-04 2022-12-07
  11. 9 4 H 2022-12-07 2022-12-07
  12. 10 5 H 2022-12-03 2022-12-03
  13. 11 5 B 2022-12-13 NaT

You can then select rows with Date before End

  1. df.loc[df[&quot;Date&quot;] &lt; df[&quot;End&quot;]]
  1. Key Activity Date End
  2. 0 1 A 2022-12-03 2022-12-06
  3. 1 1 A 2022-12-04 2022-12-06
  4. 4 2 B 2022-12-03 2022-12-06
  5. 7 4 A 2022-12-03 2022-12-07
  6. 8 4 C 2022-12-04 2022-12-07

To generate the final form - you can use .pivot_table()

  1. (df.loc[df[&quot;Date&quot;] &lt; df[&quot;End&quot;]]
  2. .pivot_table(index=&quot;Key&quot;, columns=&quot;Activity&quot;, values=&quot;Date&quot;, aggfunc=&quot;count&quot;)
  3. .reindex(df[&quot;Key&quot;].unique()) # Add in keys with no match e.g. `5`
  4. .fillna(0)
  5. .astype(int))
  1. Activity A B C
  2. Key
  3. 1 2 0 0
  4. 2 0 1 0
  5. 4 1 0 1
  6. 5 0 0 0

答案2

得分: 1

以下是您要翻译的代码部分:

  1. (df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
  2. .set_index('Key')['Activity']
  3. .str.get_dummies()
  4. .groupby(level=0).sum()
  5. .reindex(df['Key'].unique(), fill_value=0)
  6. .reset_index())
  7. (df['Activity'].where(df['Activity'].ne('H').groupby(df['Key']).cumprod())
  8. .str.get_dummies()
  9. .groupby(df['Key']).sum())
英文:

Try this:

  1. (df.loc[df[&#39;Activity&#39;].eq(&#39;H&#39;).groupby(df[&#39;Key&#39;]).cumsum().eq(0)]
  2. .set_index(&#39;Key&#39;)[&#39;Activity&#39;]
  3. .str.get_dummies()
  4. .groupby(level=0).sum()
  5. .reindex(df[&#39;Key&#39;].unique(),fill_value=0)
  6. .reset_index())

or

  1. (df[&#39;Activity&#39;].where(df[&#39;Activity&#39;].ne(&#39;H&#39;).groupby(df[&#39;Key&#39;]).cumprod())
  2. .str.get_dummies()
  3. .groupby(df[&#39;Key&#39;]).sum())

Output:

  1. Key A B C
  2. 0 1 2 0 0
  3. 1 2 0 1 0
  4. 2 4 1 0 1
  5. 3 5 0 0 0

答案3

得分: 1

  1. # 按键和日期排序
  2. df.sort_values(['Key', 'Date'], inplace=True)
  3. # 当筛选后没有保留值时,保持Key在结果中
  4. df.Key = df.Key.astype('category')
  5. # 对每个Key在第一个H之后的所有行进行筛选,然后进行数据透视
  6. df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
  7. index='Key', columns='Activity', aggfunc='size'
  8. ).reset_index()
  9. # 活动 键 A B C
  10. # 0 1 2 0 0
  11. # 1 2 0 1 0
  12. # 2 4 1 0 1
  13. # 3 5 0 0 0
英文:

You can try:

  1. # sort by Key and Date
  2. df.sort_values([&#39;Key&#39;, &#39;Date&#39;], inplace=True)
  3. # this is to keep Key in the result when no values are kept after the filter
  4. df.Key = df.Key.astype(&#39;category&#39;)
  5. # filter all rows after the 1st H for each Key and then pivot
  6. df[~df.Activity.eq(&#39;H&#39;).groupby(df.Key).cummax()].pivot_table(
  7. index=&#39;Key&#39;, columns=&#39;Activity&#39;, aggfunc=&#39;size&#39;
  8. ).reset_index()
  9. #Activity Key A B C
  10. #0 1 2 0 0
  11. #1 2 0 1 0
  12. #2 4 1 0 1
  13. #3 5 0 0 0

huangapple
  • 本文由 发表于 2023年2月6日 09:39:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75356685.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定