英文:
How to label each group with df.groupby() in Python pandas?
问题
考虑到我们有一个如下所示的pandas数据框:
Questions cnt similarity
0 ABC 1 [1, 2, 3]
1 abc 2 [1, 2, 3]
2 cba 3 [2, 3, 1]
3 abcd 4 [4, 5, 6]
4 dcsa 5 [2, 3, 1]
5 adcd 6 [4, 5, 6]
6 abcd 7 [1, 2, 3]
7 cba 8 [7, 8, 9]
我必须根据similarity
列添加另一列cat
。如果两行具有相同的similarity
,则将它们归类为同一组。以下是预期输出。任何输入都是有价值的。值得一提的是,原始数据集有1M
行。谢谢。
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4
英文:
Note: this question can be associated with one existing question here. However, my question provides a more concrete example and has broader impact.
Consider we have a pandas data frame as following:
Questions cnt similarity
0 ABC 1 [1, 2, 3]
1 abc 2 [1, 2, 3]
2 cba 3 [2, 3, 1]
3 abcd 4 [4, 5, 6]
4 dcsa 5 [2, 3, 1]
5 adcd 6 [4, 5, 6]
6 abcd 7 [1, 2, 3]
7 cba 8 [7, 8, 9]
I have to add another column called cat
based on the similarity
column. If two rows have the same similarity
, then categorize them as the same group. Below is the expected output. Any input is valuable. It is worth mentioning that the original dataset has 1M
rows. Thank you.
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4
答案1
得分: 3
IIUC,您可以使用 pd.factorize
:
df["cat"] = pd.factorize(df["similarity"].astype(str))[0] + 1
输出:
print(df)
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4
英文:
IIUC, you can use pd.factorize
:
df["cat"] = pd.factorize(df["similarity"].astype(str))[0] + 1
Output :
print(df)
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4
答案2
得分: 2
One way is to use groupby.ngroup()
:
df['cat'] = df.groupby('similarity').ngroup() + 1
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4
英文:
One way is to use groupby.ngroup()
:
df['cat'] = df.groupby('similarity').ngroup()+1
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论