英文:
How do I get columns that are generated by pandas.get_dummies()?
问题
我有以下的数据框:
如果我想为 c1, c2, c3
列创建一个独热编码列:
但如何获取由 get_dummies()
生成的列的列表呢?
例如:['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
我知道一种方法是使用 list(set(df_updated.columns) - set(df.columns))
,但是否有更好的方法?
英文:
I have the following dataframe:
>>> df
n1 n2 dense c1 c2 c3
0 1 4 [1, 4] a h1 tt
1 2 5 [2, 5] b bbw ebay
2 3 6 [3, 6] c we yahoo
If I want to create a one-hot encoding columns for c1, c2, c3
columns:
>>> df_updated = pd.get_dummies(df, prefix_sep='_', dummy_na=True, columns=['c1', 'c2', 'c3'])
>>> df_updated
n1 n2 dense c1_a c1_b c1_c c1_nan c2_bbw c2_h1 c2_we c2_nan c3_ebay c3_tt c3_yahoo c3_nan
0 1 4 [1, 4] 1 0 0 0 0 1 0 0 0 1 0 0
1 2 5 [2, 5] 0 1 0 0 1 0 0 0 1 0 0 0
2 3 6 [3, 6] 0 0 1 0 0 0 1 0 0 0 1 0
But how can I get a list of columns that is generated by get_dummies()
?
Ex. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
I know one way of doing that is list(set(df_updated.columns) - set(df.columns))
but is there a better way?
答案1
得分: 0
One way is to store the pre hot-encoded columns in a variable and then use filter
:
cols, sep = ['c1', 'c2', 'c3'], '_'
df_updated = pd.get_dummies(df, prefix_sep=sep,
dummy_na=True, columns=cols)
df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)
Or, simply and even better, use difference
:
cols_dum = list(df_updated.columns.difference(df))
Output :
print(list(df_dum.columns)) #or print(cols_dum)
['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
英文:
One way is to store the pre hot-encoded columns in a variable and then use filter
:
cols, sep = ['c1', 'c2', 'c3'], '_'
df_updated = pd.get_dummies(df, prefix_sep=sep,
dummy_na=True, columns=cols)
df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)
Or, simply and even better, use difference
:
cols_dum = list(df_updated.columns.difference(df))
Output :
print(list(df_dum.columns)) #or print(cols_dum)
['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论