英文:
How to add the sequential numbers if certain condition met in Pandas
问题
需要根据特定条件增加顺序号,否则需要保持上一个数字。
原始数据集:
ID | Name | Status | Cluster | Gap |
---|---|---|---|---|
1 | A | 0 | 1 | 15 |
1 | B | 1 | 1 | 35 |
1 | C | 1 | 1 | 03 |
2 | B | 0 | 1 | 26 |
2 | C | 0 | 1 | 16 |
3 | A | 1 | 1 | 65 |
3 | C | 0 | 1 | 89 |
3 | F | 0 | 1 | 19 |
所需数据集:
ID | Name | Status | Cluster | Gap |
---|---|---|---|---|
1 | A | 0 | 1 | 15 |
1 | B | 1 | 2 | 35 |
1 | C | 1 | 3 | 03 |
2 | B | 0 | 1 | 26 |
2 | C | 0 | 1 | 16 |
3 | A | 1 | 1 | 65 |
3 | C | 0 | 2 | 89 |
3 | F | 0 | 2 | 19 |
英文:
Need to increase the sequential numbers if certain condition is met or else need to hold the previous number.
Original_dataset:
ID | Name | Status | Cluster | Gap |
---|---|---|---|---|
1 | A | 0 | 1 | 15 |
1 | B | 1 | 1 | 35 |
1 | C | 1 | 1 | 03 |
2 | B | 0 | 1 | 26 |
2 | C | 0 | 1 | 16 |
3 | A | 1 | 1 | 65 |
3 | C | 0 | 1 | 89 |
3 | F | 0 | 1 | 19 |
Required_Dataset:
ID | Name | Status | Cluster | Gap |
---|---|---|---|---|
1 | A | 0 | 1 | 15 |
1 | B | 1 | 2 | 35 |
1 | C | 1 | 3 | 03 |
2 | B | 0 | 1 | 26 |
2 | C | 0 | 1 | 16 |
3 | A | 1 | 1 | 65 |
3 | C | 0 | 2 | 89 |
3 | F | 0 | 2 | 19 |
Conditions:
- For first occurrence of ID, the cluster should be 1.
- If status = 1 or Gap > 28, then cluster needs to increase by 1 based on patient ID (see row 1-C and 2-B --- as ID changes, the Cluster remains 1 as it is the first occurrence of the particular ID).
- If the condition is not satisfied, it needs to hold the previous cluster number. (Can refer the final row).
The code which I have tried is:
Original_dataset.loc[((new_df4['gap'] > 28) | (Original_dataset['status'] == 1)),'Cluster'] = Original_dataset['Cluster'] + 1
答案1
得分: 2
首先,如果Gap
大于28
,或者Status
为1
,或者在DataFrame.loc
中的ID
的第一个重复值之前,将Cluster1
设置为1
。然后使用lambda函数结合GroupBy.cumsum
和GroupBy.ffill
。
m = df['Gap'].gt(28)
m1 = df['Status'].eq(1)
m2 = ~df['ID'].duplicated()
df.loc[m | m1 | m2, 'Cluster1'] = 1
f = lambda x: x.cumsum().ffill(downcast='int')
df['Cluster1'] = df.groupby('ID')['Cluster1'].transform(f)
print(df)
ID Name Status Cluster Gap Cluster1
0 1 A 0 1 15 1
1 1 B 1 1 35 2
2 1 C 1 1 3 3
3 2 B 0 1 26 1
4 2 C 0 1 16 1
5 3 A 1 1 65 1
6 3 C 0 1 89 2
7 3 F 0 1 19 2
英文:
First set 1
if greater GAP like 28
or if Status
is 1
or first duplicated value of ID
in DataFrame.loc
, then use lambda function with GroupBy.cumsum
and GroupBy.ffill
m = df['Gap'].gt(28)
m1 = df['Status'].eq(1)
m2 = ~df['ID'].duplicated()
df.loc[m | m1 | m2, 'Cluster1'] = 1
f = lambda x: x.cumsum().ffill(downcast='int')
df['Cluster1'] = df.groupby('ID')['Cluster1'].transform(f)
print (df)
ID Name Status Cluster Gap Cluster1
0 1 A 0 1 15 1
1 1 B 1 1 35 2
2 1 C 1 1 3 3
3 2 B 0 1 26 1
4 2 C 0 1 16 1
5 3 A 1 1 65 1
6 3 C 0 1 89 2
7 3 F 0 1 19 2
答案2
得分: 0
您可以根据3个条件使用掩码,并使用简单的[`groupby.cumsum`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumsum.html):
是否 Gap≥28 或者 Status==1 或者第一个 ID?
m = df['Gap'].ge(28) | df['Status'].eq(1) | ~df['ID'].duplicated()
然后按组递增计数
df['Cluster'] = m.groupby(df['ID']).cumsum()
输出:
ID 名称 状态 集群 差距
0 1 A 0 1 15
1 1 B 1 2 35
2 1 C 1 3 3
3 2 B 0 1 26
4 2 C 0 1 16
5 3 A 1 1 65
6 3 C 0 2 89
7 3 F 0 2 19
<details>
<summary>英文:</summary>
You can use a mask based on 3 conditions, and a simple [`groupby.cumsum`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumsum.html):
is Gap≥28 OR Status==1 OR first ID ?
m = df['Gap'].ge(28) | df['Status'].eq(1) | ~df['ID'].duplicated()
then increment count per group
df['Cluster'] = m.groupby(df['ID']).cumsum()
Output:
ID Name Status Cluster Gap
0 1 A 0 1 15
1 1 B 1 2 35
2 1 C 1 3 3
3 2 B 0 1 26
4 2 C 0 1 16
5 3 A 1 1 65
6 3 C 0 2 89
7 3 F 0 2 19
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论