如何获取由pandas.get_dummies()生成的列?

huangapple go评论87阅读模式
英文:

How do I get columns that are generated by pandas.get_dummies()?

问题

我有以下的数据框:

如果我想为 c1, c2, c3 列创建一个独热编码列:

但如何获取由 get_dummies() 生成的列的列表呢?

例如:['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']

我知道一种方法是使用 list(set(df_updated.columns) - set(df.columns)),但是否有更好的方法?

英文:

I have the following dataframe:

  1. >>> df
  2. n1 n2 dense c1 c2 c3
  3. 0 1 4 [1, 4] a h1 tt
  4. 1 2 5 [2, 5] b bbw ebay
  5. 2 3 6 [3, 6] c we yahoo

If I want to create a one-hot encoding columns for c1, c2, c3 columns:

  1. >>> df_updated = pd.get_dummies(df, prefix_sep='_', dummy_na=True, columns=['c1', 'c2', 'c3'])
  2. >>> df_updated
  3. n1 n2 dense c1_a c1_b c1_c c1_nan c2_bbw c2_h1 c2_we c2_nan c3_ebay c3_tt c3_yahoo c3_nan
  4. 0 1 4 [1, 4] 1 0 0 0 0 1 0 0 0 1 0 0
  5. 1 2 5 [2, 5] 0 1 0 0 1 0 0 0 1 0 0 0
  6. 2 3 6 [3, 6] 0 0 1 0 0 0 1 0 0 0 1 0

But how can I get a list of columns that is generated by get_dummies()?

Ex. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']

I know one way of doing that is list(set(df_updated.columns) - set(df.columns)) but is there a better way?

答案1

得分: 0

One way is to store the pre hot-encoded columns in a variable and then use filter :

  1. cols, sep = ['c1', 'c2', 'c3'], '_'
  2. df_updated = pd.get_dummies(df, prefix_sep=sep,
  3. dummy_na=True, columns=cols)
  4. df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)

Or, simply and even better, use difference :

  1. cols_dum = list(df_updated.columns.difference(df))

Output :

  1. print(list(df_dum.columns)) #or print(cols_dum)
  2. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
  3. 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
英文:

One way is to store the pre hot-encoded columns in a variable and then use filter :

  1. cols, sep = ['c1', 'c2', 'c3'], '_'
  2. df_updated = pd.get_dummies(df, prefix_sep=sep,
  3. dummy_na=True, columns=cols)
  4. df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)

Or, simply and even better, use difference :

  1. cols_dum = list(df_updated.columns.difference(df))


Output :

  1. print(list(df_dum.columns)) #or print(cols_dum)
  2. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
  3. 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']

huangapple
  • 本文由 发表于 2023年2月6日 06:07:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75355852.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定