如何有条件地对数据框的行进行分组?

huangapple go评论69阅读模式
英文:

How do I conditionally group rows of a dataframe?

问题

df的第2列中,有三种可能的值:XYZ。我想按照值X以及直接跟在X后面的任何Y值来分组行。我不关心在这些分组中保留Z值。

我尝试使用groupby(),像这样:df.groupby(df[2] == 'X'),然而这显然只会获取X值。

我应该如何创建我想要的分组?

df = pd.DataFrame({'1':['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p'],
                   '2':['Z','X','Y','Z','Z','X','X','Z','X','Y','Y','Z','X','Z','X','Y']})

期望的分组:

1	b	X
2	c	Y
---------
5	f	X
---------
6	g	X
---------
8	i	X
9	j	Y
10	k	Y
---------
12	m	X
---------
14	o	X
15	p	Y
英文:

In column 2 of df, there are three possible values: X, Y, Z. I want to group rows by the value X along with any trailing Y values in the columns directly following X. I am not interested in preserving the Z values in the groups.

I have tried using groupby() like this: df.groupby(df[2] == 'X'), however this obviously only grabs the X values.

How could I go about creating the groupings that I am after?

df = pd.DataFrame({'1':['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p'],
                   '2':['Z','X','Y','Z','Z','X','X','Z','X','Y','Y','Z','X','Z','X','Y']})

Desired groupings:

1	b	X
2	c	Y
---------
5	f	X
---------
6	g	X
---------
8	i	X
9	j	Y
10	k	Y
---------
12	m	X
---------
14	o	X
15	p	Y

答案1

得分: 2

>>> list(df.groupby(df['2'].eq('X').cumsum().loc[df['2'] != 'Z']))
[(1.0,
     1  2
  1  b  X
  2  c  Y),
 (2.0,
     1  2
  5  f  X),
 (3.0,
     1  2
  6  g  X),
 (4.0,
      1  2
  8   i  X
  9   j  Y
  10  k  Y),
 (5.0,
      1  2
  12  m  X),
 (6.0,
      1  2
  14  o  X
  15  p  Y)]

Details:

# Same as grp = df['2'].eq('X').cumsum().where(df['2'] != 'Z')
grp = df['2'].eq('X').cumsum().loc[df['2'] != 'Z']
pd.concat([df, grp.rename('G')], axis=1)

# Output
    1  2    G
0   a  Z  NaN
1   b  X  1.0
2   c  Y  1.0
3   d  Z  NaN
4   e  Z  NaN
5   f  X  2.0
6   g  X  3.0
7   h  Z  NaN
8   i  X  4.0
9   j  Y  4.0
10  k  Y  4.0
11  l  Z  NaN
12  m  X  5.0
13  n  Z  NaN
14  o  X  6.0
15  p  Y  6.0
英文:

You can use:

>>> list(df.groupby(df['2'].eq('X').cumsum().loc[df['2'] != 'Z']))
[(1.0,
     1  2
  1  b  X
  2  c  Y),
 (2.0,
     1  2
  5  f  X),
 (3.0,
     1  2
  6  g  X),
 (4.0,
      1  2
  8   i  X
  9   j  Y
  10  k  Y),
 (5.0,
      1  2
  12  m  X),
 (6.0,
      1  2
  14  o  X
  15  p  Y)]

Details:

# Same as grp = df['2'].eq('X').cumsum().where(df['2'] != 'Z')
grp = df['2'].eq('X').cumsum().loc[df['2'] != 'Z']
pd.concat([df, grp.rename('G')], axis=1)

# Output
    1  2    G
0   a  Z  NaN
1   b  X  1.0
2   c  Y  1.0
3   d  Z  NaN
4   e  Z  NaN
5   f  X  2.0
6   g  X  3.0
7   h  Z  NaN
8   i  X  4.0
9   j  Y  4.0
10  k  Y  4.0
11  l  Z  NaN
12  m  X  5.0
13  n  Z  NaN
14  o  X  6.0
15  p  Y  6.0

答案2

得分: 1

我预计你只想要一个表示每个以X开头的组的列。如果是这样的话,你可以按照以下步骤进行操作:

[1] 创建一个列来检查数值是否为X,如果是则为1,否则为0。

[2] 进行cumsum操作,然后你应该得到所需的组。

df['check_x'] = np.where(df['2'] == 'X', 1, 0)
df['group'] = df['check_x'].cumsum()
df
英文:

I expect you just want to have a column that indicate for every group staring with X. If this is the case then, you can do the following:

[1] Create a column check if the value is X or not, if X then 1 else 0

[2] Do the cumsum then you should have the group as desired

df['check_x'] = np.where(df['2']=='X', 1, 0)
df['group'] = df['check_x'].cumsum()
df

答案3

得分: -1

以下是代码的翻译部分:

import pandas as pd

df = pd.DataFrame({'1': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p'],
                   '2': ['Z', 'X', 'Y', 'Z', 'Z', 'X', 'X', 'Z', 'X', 'Y', 'Y', 'Z', 'X', 'Z', 'X', 'Y']})

# 找到列2中 'X' 出现的索引
x_indices = df.index[df['2'] == 'X']

# 初始化一个空列表以存储分组
grouped_indices = []

# 遍历X索引并创建分组
for i in range(len(x_indices)):
    start_index = x_indices[i]  # 分组的起始索引
    end_index = len(df) if i == len(x_indices) - 1 else x_indices[i + 1]  # 分组的结束索引
    group = df[start_index:end_index]  # 根据起始和结束索引切片数据帧
    group = group[group['2'] != 'Z']  # 排除列2中包含'Z'的行
    grouped_indices.append(group)  # 将分组添加到列表中

# 打印所需的分组
for group in grouped_indices:
    print(group)
    print('---------')
#    1  2
# 1  b  X
# 2  c  Y
# ---------
#    1  2
# 5  f  X
# ---------
#    1  2
# 6  g  X
# ---------
#     1  2
# 8   i  X
# 9   j  Y
# 10  k  Y
# ---------
#     1  2
# 12  m  X
# ---------
#     1  2
# 14  o  X
# 15  p  Y
# ---------
英文:

Example:

import pandas as pd

df = pd.DataFrame({'1': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p'],
                   '2': ['Z', 'X', 'Y', 'Z', 'Z', 'X', 'X', 'Z', 'X', 'Y', 'Y', 'Z', 'X', 'Z', 'X', 'Y']})

# Find the indices where 'X' occurs in column 2
x_indices = df.index[df['2'] == 'X']

# Initialize an empty list to store the groupings
grouped_indices = []

# Iterate over the X indices and create groups
for i in range(len(x_indices)):
    start_index = x_indices[i]  # Starting index of the group
    end_index = len(df) if i == len(x_indices) - 1 else x_indices[i + 1]  # Ending index of the group
    group = df[start_index:end_index]  # Slice the dataframe based on the start and end indices
    group = group[group['2'] != 'Z']  # Exclude rows with 'Z' in column 2
    grouped_indices.append(group)  # Add the group to the list

# Print the desired groupings
for group in grouped_indices:
    print(group)
    print('---------')
#    1  2
# 1  b  X
# 2  c  Y
# ---------
#    1  2
# 5  f  X
# ---------
#    1  2
# 6  g  X
# ---------
#     1  2
# 8   i  X
# 9   j  Y
# 10  k  Y
# ---------
#     1  2
# 12  m  X
# ---------
#     1  2
# 14  o  X
# 15  p  Y
# ---------

huangapple
  • 本文由 发表于 2023年6月8日 19:04:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76431180.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定