英文:
pandas: group dataframe rows into different clusters
问题
我有这个数据框:
df = pd.DataFrame({
'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
2, 2, False, 3, 3, 3, False]})
想法是,当满足某些条件的行时,将这些情况选择为True,当连续的行满足条件时,它们就形成了一个簇。我想能够为每个簇添加标签,如cluster_1
,cluster_2
,cluster_3
等。我已经提供了希望得到的输出示例,使用cluster_number
列。但是,由于在实际数据中,我需要在不同数据集上多次执行此操作,这些数据集的行数不同,每次簇的大小也不同,我不知道如何做到这一点。提前感谢您的帮助!
英文:
I have this dataframe:
df = pd.DataFrame({
'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
2, 2, False, 3, 3, 3, False]})
The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1
, cluster_2
, cluster_3
etc. I've given an example of the hoped for output with the column cluster_number
. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this? Thanks in advance!
答案1
得分: 2
你可以使用过滤后的 groupby.ngroup
,然后 reindex
来添加 False
:
df['cluster_number'] = (df[df['forms_a_cluster']]
.groupby('id').ngroup().add(1)
.reindex(df.index, fill_value=False)
)
输出:
id forms_a_cluster cluster_number
0 1 False False
1 1 False False
2 2 True 1
3 2 True 1
4 2 True 1
5 3 False False
6 3 False False
7 3 False False
8 4 True 2
9 4 True 2
10 5 False False
11 6 True 3
12 6 True 3
13 6 True 3
14 7 False False
请注意,这是代码的翻译和输出的示例。如有其他问题或需要进一步的帮助,请告诉我。
英文:
You can use a filtered groupby.ngroup
then reindex
to add the False
:
df['cluster_number'] = (df[df['forms_a_cluster']]
.groupby('id').ngroup().add(1)
.reindex(df.index, fill_value=False)
)
Output:
id forms_a_cluster cluster_number
0 1 False False
1 1 False False
2 2 True 1
3 2 True 1
4 2 True 1
5 3 False False
6 3 False False
7 3 False False
8 4 True 2
9 4 True 2
10 5 False False
11 6 True 3
12 6 True 3
13 6 True 3
14 7 False False
答案2
得分: 0
你可以使用ngroup
来完成这个操作:
df["cluster_number"] = False
df.loc[df["forms_a_cluster"], "cluster_number"] = (
df.loc[df["forms_a_cluster"]].groupby(["id", "forms_a_cluster"]).ngroup() + 1
)
print(df)
id forms_a_cluster cluster_number
0 1 False False
1 1 False False
2 2 True 1
3 2 True 1
4 2 True 1
5 3 False False
6 3 False False
7 3 False False
8 4 True 2
9 4 True 2
10 5 False False
11 6 True 3
12 6 True 3
13 6 True 3
14 7 False False
英文:
You can use ngroup
to do this:
df["cluster_number"] = False
df.loc[df["forms_a_cluster"], "cluster_number"] = (
df.loc[df["forms_a_cluster"]].groupby(["id", "forms_a_cluster"]).ngroup() + 1
)
print(df)
id forms_a_cluster cluster_number
0 1 False False
1 1 False False
2 2 True 1
3 2 True 1
4 2 True 1
5 3 False False
6 3 False False
7 3 False False
8 4 True 2
9 4 True 2
10 5 False False
11 6 True 3
12 6 True 3
13 6 True 3
14 7 False False
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论