在Python中拆分后提取数值以创建一个新列,标记为是或否

huangapple go评论69阅读模式
英文:

Extracting values after a split to create a new column with a yes or no in Python

问题

sampleID comorbidities hypertension diabetes CHD asthma
P01 hypertension, diabetes yes yes no no
P02 hypertension, diabetes yes yes no no
P03 diabetes no yes no no
P04 CHD, asthma no no yes yes
P05 asthma, hypertension yes no no yes
英文:
sampleID comorbidities
P01 hypertension, diabetes
P02 hypertension, diabetes
P03 diabetes
P04 CHD, asthma
P05 asthma, hypertension

Hello, I am new to coding and am currently working on some data cleaning using Python and I am trying to break apart my data so that I can perform some better analysis. I currently have a few columns that contain multiple strings within one column. For example, one column is the comorbidities of a patient and some patients have multiple comorbidities within that one column. I am trying to split the data, which are strings, so that there is a new column with a simple yes/no or 1/0 for each patient. I am unable to post pictures so I recreated the tables.

Currently I have one column that has multiple strings contained within it. I split the column using:
df1 = pd.concat((df, df['comorbidities'].str.split(',', expand = True)), axis = 1, ignore_index = True)

The resulting dataframe looks like this:

0 1 2 3
P01 hypertension, diabetes hypertension diabetes
P02 hypertension, diabetes hypertension diabetes
P03 diabetes diabetes None
P04 CHD, asthma CHD asthma
P05 asthma, hypertension asthma hypertension

After this, I am trying to take the split strings and create a new column that will contain either yes/no or 1/0. So that each sample will be able to tell me if they have this or not. Any suggestions as to how to do this? I have tried groupby on just one column, and on all the columns and it does not work. I can't share the actual data but I created a dummy dataset with an example and the output I want below.

sampleID comorbidities hypertension diabetes CHD asthma
P01 hypertension, diabetes yes yes no no
P02 hypertension, diabetes yes yes no no
P03 diabetes no yes no no
P04 CHD, asthma no no yes yes
P05 asthma, hypertension yes no no yes

For example, what I am trying to do is take hypertension and create a new column with the name hypertension, and a simple yes/no or 1/0 for each sampleID. Any suggestions would be greatly appreciated!

答案1

得分: 0

使用str.get_dummies结合replacejoin

out = df.join(df['comorbidities'].str.get_dummies(', ').replace({0: 'no', 1: 'yes'}))

输出:

  sampleID           comorbidities  CHD asthma diabetes hypertension
0      P01  hypertension, diabetes   no     no      yes          yes
1      P02  hypertension, diabetes   no     no      yes          yes
2      P03                diabetes   no     no      yes           no
3      P04             CHD, asthma  yes    yes       no           no
4      P05    asthma, hypertension   no    yes       no          yes
英文:

Use str.get_dummies combined with replace and join:

out = df.join(df['comorbidities'].str.get_dummies(', ')
                                 .replace({0: 'no', 1: 'yes'}))

Output:

  sampleID           comorbidities  CHD asthma diabetes hypertension
0      P01  hypertension, diabetes   no     no      yes          yes
1      P02  hypertension, diabetes   no     no      yes          yes
2      P03                diabetes   no     no      yes           no
3      P04             CHD, asthma  yes    yes       no           no
4      P05    asthma, hypertension   no    yes       no          yes

huangapple
  • 本文由 发表于 2023年3月31日 22:56:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899958.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定