英文:
Extracting values after a split to create a new column with a yes or no in Python
问题
sampleID | comorbidities | hypertension | diabetes | CHD | asthma |
---|---|---|---|---|---|
P01 | hypertension, diabetes | yes | yes | no | no |
P02 | hypertension, diabetes | yes | yes | no | no |
P03 | diabetes | no | yes | no | no |
P04 | CHD, asthma | no | no | yes | yes |
P05 | asthma, hypertension | yes | no | no | yes |
英文:
sampleID | comorbidities |
---|---|
P01 | hypertension, diabetes |
P02 | hypertension, diabetes |
P03 | diabetes |
P04 | CHD, asthma |
P05 | asthma, hypertension |
Hello, I am new to coding and am currently working on some data cleaning using Python and I am trying to break apart my data so that I can perform some better analysis. I currently have a few columns that contain multiple strings within one column. For example, one column is the comorbidities of a patient and some patients have multiple comorbidities within that one column. I am trying to split the data, which are strings, so that there is a new column with a simple yes/no or 1/0 for each patient. I am unable to post pictures so I recreated the tables.
Currently I have one column that has multiple strings contained within it. I split the column using:
df1 = pd.concat((df, df['comorbidities'].str.split(',', expand = True)), axis = 1, ignore_index = True)
The resulting dataframe looks like this:
0 | 1 | 2 | 3 |
---|---|---|---|
P01 | hypertension, diabetes | hypertension | diabetes |
P02 | hypertension, diabetes | hypertension | diabetes |
P03 | diabetes | diabetes | None |
P04 | CHD, asthma | CHD | asthma |
P05 | asthma, hypertension | asthma | hypertension |
After this, I am trying to take the split strings and create a new column that will contain either yes/no or 1/0. So that each sample will be able to tell me if they have this or not. Any suggestions as to how to do this? I have tried groupby on just one column, and on all the columns and it does not work. I can't share the actual data but I created a dummy dataset with an example and the output I want below.
sampleID | comorbidities | hypertension | diabetes | CHD | asthma |
---|---|---|---|---|---|
P01 | hypertension, diabetes | yes | yes | no | no |
P02 | hypertension, diabetes | yes | yes | no | no |
P03 | diabetes | no | yes | no | no |
P04 | CHD, asthma | no | no | yes | yes |
P05 | asthma, hypertension | yes | no | no | yes |
For example, what I am trying to do is take hypertension and create a new column with the name hypertension, and a simple yes/no or 1/0 for each sampleID. Any suggestions would be greatly appreciated!
答案1
得分: 0
使用str.get_dummies
结合replace
和join
:
out = df.join(df['comorbidities'].str.get_dummies(', ').replace({0: 'no', 1: 'yes'}))
输出:
sampleID comorbidities CHD asthma diabetes hypertension
0 P01 hypertension, diabetes no no yes yes
1 P02 hypertension, diabetes no no yes yes
2 P03 diabetes no no yes no
3 P04 CHD, asthma yes yes no no
4 P05 asthma, hypertension no yes no yes
英文:
Use str.get_dummies
combined with replace
and join
:
out = df.join(df['comorbidities'].str.get_dummies(', ')
.replace({0: 'no', 1: 'yes'}))
Output:
sampleID comorbidities CHD asthma diabetes hypertension
0 P01 hypertension, diabetes no no yes yes
1 P02 hypertension, diabetes no no yes yes
2 P03 diabetes no no yes no
3 P04 CHD, asthma yes yes no no
4 P05 asthma, hypertension no yes no yes
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论