pandas:为满足条件的行添加一个分组变量的集群

huangapple go评论77阅读模式
英文:

pandas: add a grouping variable to clusters of rows that meet a criteria

问题

我有这个数据框:

df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
    True, True, False, True, True, True, False],
    'cluster_number': [False, False, 1, 1, 1, False, False, False,
    2, 2, False, 3, 3, 3, False]})

思路是,当某些行满足一定条件时,将这些情况标记为True,当连续的行满足条件时,它们形成一个簇。我想要能够为每个簇标记为cluster_1cluster_2cluster_3等。我已经给出了希望得到的输出示例,其中有一个名为cluster_number的列。但是,考虑到实际数据中,我需要在不同数据集上多次执行此操作,这些数据集的行数不同,每次簇的大小也会不同。你有任何关于如何解决这个问题的想法吗?

英文:

I have this dataframe:

df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
              2, 2, False, 3, 3, 3, False]})

The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1, cluster_2, cluster_3 etc. I've given an example of the hoped for output with the column cluster_number. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this?

答案1

得分: 2

以下是翻译好的代码部分:

# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()

# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
                        .groupby(m).ngroup().add(1)
                        .reindex(df.index, fill_value=False)
                        )
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
                        .mask(df['forms_a_cluster'],
                              m//2 + df['forms_a_cluster'].iloc[0])
                       )

希望这些翻译能对您有所帮助。

英文:

You can use a groupby.ngroup on the groups of successive values pre-filtered to the True:

# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()

# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
                        .groupby(m).ngroup().add(1)
                        .reindex(df.index, fill_value=False)
                        )

Or with arithmetics:

m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
                        .mask(df['forms_a_cluster'],
                              m//2 + df['forms_a_cluster'].iloc[0])
                       )

Output:

    forms_a_cluster cluster_number
0             False          False
1             False          False
2              True              1
3              True              1
4              True              1
5             False          False
6             False          False
7             False          False
8              True              2
9              True              2
10            False          False
11             True              3
12             True              3
13             True              3
14            False          False

Other example:

    forms_a_cluster cluster_number
0              True              1
1              True              1
2             False          False
3              True              2
4              True              2
5             False          False
6             False          False
7             False          False
8              True              3
9              True              3
10            False          False
11             True              4
12             True              4
13             True              4
14            False          False

答案2

得分: 0

创建连续的分组,并通过 factorize 为起始组添加 1,通过 Series.cumsum 来创建递增的分组:

#掩码
m = df['forms_a_cluster']
#创建一个用原始数据填充的列
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
m = df['forms_a_cluster']
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
print(df)

结果如下:

   forms_a_cluster  cluster_number
0            False           False
1            False           False
2             True               1
3             True               1
4             True               1
5            False           False
6            False           False
7            False           False
8             True               2
9             True               2
10           False           False
11            True               3
12            True               3
13            True               3
14           False           False
   forms_a_cluster  cluster_number
0             True               1
1            False           False
2             True               2
3             True               2
4             True               2
5            False           False
6            False           False
7            False           False
8             True               3
9             True               3
10           False           False
11            True               4
12            True               4
13            True               4
14           False           False
英文:

Create consecutive groups and add factorize for starting groups by 1 with consecutive groups by Series.cumsum with inverted:

#mask
m = df['forms_a_cluster']
#create column filled by original data
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
    forms_a_cluster cluster_number
0             False          False
1             False          False
2              True              1
3              True              1
4              True              1
5             False          False
6             False          False
7             False          False
8              True              2
9              True              2
10            False          False
11             True              3
12             True              3
13             True              3
14            False          False

m = df['forms_a_cluster']
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
print (df)
    forms_a_cluster cluster_number
0              True              1
1             False          False
2              True              2
3              True              2
4              True              2
5             False          False
6             False          False
7             False          False
8              True              3
9              True              3
10            False          False
11             True              4
12             True              4
13             True              4
14            False          False

huangapple
  • 本文由 发表于 2023年2月27日 17:12:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75578556.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定