pandas:将数据框的行分组到不同的聚类中

huangapple go评论65阅读模式
英文:

pandas: group dataframe rows into different clusters

问题

我有这个数据框:

df = pd.DataFrame({
'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
2, 2, False, 3, 3, 3, False]})

想法是,当满足某些条件的行时,将这些情况选择为True,当连续的行满足条件时,它们就形成了一个簇。我想能够为每个簇添加标签,如cluster_1cluster_2cluster_3等。我已经提供了希望得到的输出示例,使用cluster_number列。但是,由于在实际数据中,我需要在不同数据集上多次执行此操作,这些数据集的行数不同,每次簇的大小也不同,我不知道如何做到这一点。提前感谢您的帮助!

英文:

I have this dataframe:

df = pd.DataFrame({
'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
              2, 2, False, 3, 3, 3, False]})

The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1, cluster_2, cluster_3 etc. I've given an example of the hoped for output with the column cluster_number. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this? Thanks in advance!

答案1

得分: 2

你可以使用过滤后的 groupby.ngroup,然后 reindex 来添加 False

df['cluster_number'] = (df[df['forms_a_cluster']]
                        .groupby('id').ngroup().add(1)
                        .reindex(df.index, fill_value=False)
                        )

输出:

    id  forms_a_cluster cluster_number
0    1            False          False
1    1            False          False
2    2             True              1
3    2             True              1
4    2             True              1
5    3            False          False
6    3            False          False
7    3            False          False
8    4             True              2
9    4             True              2
10   5            False          False
11   6             True              3
12   6             True              3
13   6             True              3
14   7            False          False

请注意,这是代码的翻译和输出的示例。如有其他问题或需要进一步的帮助,请告诉我。

英文:

You can use a filtered groupby.ngroup then reindex to add the False:

df['cluster_number'] = (df[df['forms_a_cluster']]
                        .groupby('id').ngroup().add(1)
                        .reindex(df.index, fill_value=False)
                        )

Output:

    id  forms_a_cluster cluster_number
0    1            False          False
1    1            False          False
2    2             True              1
3    2             True              1
4    2             True              1
5    3            False          False
6    3            False          False
7    3            False          False
8    4             True              2
9    4             True              2
10   5            False          False
11   6             True              3
12   6             True              3
13   6             True              3
14   7            False          False

答案2

得分: 0

你可以使用ngroup来完成这个操作:

df["cluster_number"] = False
df.loc[df["forms_a_cluster"], "cluster_number"] = (
    df.loc[df["forms_a_cluster"]].groupby(["id", "forms_a_cluster"]).ngroup() + 1
)

print(df)

   id  forms_a_cluster  cluster_number
0   1            False           False
1   1            False           False
2   2             True            1
3   2             True            1
4   2             True            1
5   3            False           False
6   3            False           False
7   3            False           False
8   4             True            2
9   4             True            2
10  5            False           False
11  6             True            3
12  6             True            3
13  6             True            3
14  7            False           False
英文:

You can use ngroup to do this:

df["cluster_number"] = False
df.loc[df["forms_a_cluster"], "cluster_number"] = (
    df.loc[df["forms_a_cluster"]].groupby(["id", "forms_a_cluster"]).ngroup() + 1
)


print(df)

    id  forms_a_cluster cluster_number
0    1            False          False
1    1            False          False
2    2             True              1
3    2             True              1
4    2             True              1
5    3            False          False
6    3            False          False
7    3            False          False
8    4             True              2
9    4             True              2
10   5            False          False
11   6             True              3
12   6             True              3
13   6             True              3
14   7            False          False

huangapple
  • 本文由 发表于 2023年2月27日 04:47:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75574916.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定