如何获取由pandas.get_dummies()生成的列?

huangapple go评论63阅读模式
英文:

How do I get columns that are generated by pandas.get_dummies()?

问题

我有以下的数据框:

如果我想为 c1, c2, c3 列创建一个独热编码列:

但如何获取由 get_dummies() 生成的列的列表呢?

例如:['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']

我知道一种方法是使用 list(set(df_updated.columns) - set(df.columns)),但是否有更好的方法?

英文:

I have the following dataframe:

>>> df
   n1  n2   dense c1   c2     c3
0   1   4  [1, 4]  a   h1     tt
1   2   5  [2, 5]  b  bbw   ebay
2   3   6  [3, 6]  c   we  yahoo

If I want to create a one-hot encoding columns for c1, c2, c3 columns:

>>> df_updated = pd.get_dummies(df, prefix_sep='_', dummy_na=True, columns=['c1', 'c2', 'c3'])
>>> df_updated
   n1  n2   dense  c1_a  c1_b  c1_c  c1_nan  c2_bbw  c2_h1  c2_we  c2_nan  c3_ebay  c3_tt  c3_yahoo  c3_nan
0   1   4  [1, 4]     1     0     0       0       0      1      0       0        0      1         0       0
1   2   5  [2, 5]     0     1     0       0       1      0      0       0        1      0         0       0
2   3   6  [3, 6]     0     0     1       0       0      0      1       0        0      0         1       0

But how can I get a list of columns that is generated by get_dummies()?

Ex. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']

I know one way of doing that is list(set(df_updated.columns) - set(df.columns)) but is there a better way?

答案1

得分: 0

One way is to store the pre hot-encoded columns in a variable and then use filter :

cols, sep = ['c1', 'c2', 'c3'], '_'

df_updated = pd.get_dummies(df, prefix_sep=sep,
                            dummy_na=True, columns=cols)

df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)

Or, simply and even better, use difference :

cols_dum = list(df_updated.columns.difference(df))

Output :

print(list(df_dum.columns)) #or print(cols_dum)

['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
英文:

One way is to store the pre hot-encoded columns in a variable and then use filter :

cols, sep = ['c1', 'c2', 'c3'], '_'

df_updated = pd.get_dummies(df, prefix_sep=sep,
                            dummy_na=True, columns=cols)
​
df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)

Or, simply and even better, use difference :

cols_dum = list(df_updated.columns.difference(df))


Output :

print(list(df_dum.columns)) #or print(cols_dum)

['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']

huangapple
  • 本文由 发表于 2023年2月6日 06:07:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75355852.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定