英文:
pandas: add a grouping variable to clusters of rows that meet a criteria
问题
我有这个数据框:
df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number': [False, False, 1, 1, 1, False, False, False,
2, 2, False, 3, 3, 3, False]})
思路是,当某些行满足一定条件时,将这些情况标记为True,当连续的行满足条件时,它们形成一个簇。我想要能够为每个簇标记为cluster_1
,cluster_2
,cluster_3
等。我已经给出了希望得到的输出示例,其中有一个名为cluster_number
的列。但是,考虑到实际数据中,我需要在不同数据集上多次执行此操作,这些数据集的行数不同,每次簇的大小也会不同。你有任何关于如何解决这个问题的想法吗?
英文:
I have this dataframe:
df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
2, 2, False, 3, 3, 3, False]})
The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1
, cluster_2
, cluster_3
etc. I've given an example of the hoped for output with the column cluster_number
. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this?
答案1
得分: 2
以下是翻译好的代码部分:
# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
.groupby(m).ngroup().add(1)
.reindex(df.index, fill_value=False)
)
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
.mask(df['forms_a_cluster'],
m//2 + df['forms_a_cluster'].iloc[0])
)
希望这些翻译能对您有所帮助。
英文:
You can use a groupby.ngroup
on the groups of successive values pre-filtered to the True
:
# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
.groupby(m).ngroup().add(1)
.reindex(df.index, fill_value=False)
)
Or with arithmetics:
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
.mask(df['forms_a_cluster'],
m//2 + df['forms_a_cluster'].iloc[0])
)
Output:
forms_a_cluster cluster_number
0 False False
1 False False
2 True 1
3 True 1
4 True 1
5 False False
6 False False
7 False False
8 True 2
9 True 2
10 False False
11 True 3
12 True 3
13 True 3
14 False False
Other example:
forms_a_cluster cluster_number
0 True 1
1 True 1
2 False False
3 True 2
4 True 2
5 False False
6 False False
7 False False
8 True 3
9 True 3
10 False False
11 True 4
12 True 4
13 True 4
14 False False
答案2
得分: 0
创建连续的分组,并通过 factorize
为起始组添加 1
,通过 Series.cumsum
来创建递增的分组:
#掩码
m = df['forms_a_cluster']
#创建一个用原始数据填充的列
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
m = df['forms_a_cluster']
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
print(df)
结果如下:
forms_a_cluster cluster_number
0 False False
1 False False
2 True 1
3 True 1
4 True 1
5 False False
6 False False
7 False False
8 True 2
9 True 2
10 False False
11 True 3
12 True 3
13 True 3
14 False False
forms_a_cluster cluster_number
0 True 1
1 False False
2 True 2
3 True 2
4 True 2
5 False False
6 False False
7 False False
8 True 3
9 True 3
10 False False
11 True 4
12 True 4
13 True 4
14 False False
英文:
Create consecutive groups and add factorize
for starting groups by 1
with consecutive groups by Series.cumsum
with inverted:
#mask
m = df['forms_a_cluster']
#create column filled by original data
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
forms_a_cluster cluster_number
0 False False
1 False False
2 True 1
3 True 1
4 True 1
5 False False
6 False False
7 False False
8 True 2
9 True 2
10 False False
11 True 3
12 True 3
13 True 3
14 False False
m = df['forms_a_cluster']
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
print (df)
forms_a_cluster cluster_number
0 True 1
1 False False
2 True 2
3 True 2
4 True 2
5 False False
6 False False
7 False False
8 True 3
9 True 3
10 False False
11 True 4
12 True 4
13 True 4
14 False False
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论