如何在R中对共享类别的多列进行独热编码?

huangapple go评论58阅读模式
英文:

How can I one-hot-encode multiple columns in R that share categories?

问题

以下是您要求的代码部分的翻译:

df <- df %>% 
  mutate(is_A = if_else(label1 == 'A' | label2 == 'A', 1, 0),
         is_B = if_else(label1 == 'B' | label2 == 'B', 1, 0),
         is_C = if_else(label1 == 'C' | label2 == 'C', 1, 0))
encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)

这些代码将会根据条件在DataFrame中添加is_Ais_Bis_C列,其中1表示条件满足,0表示条件不满足。这将为您提供所需的编码。如果您有其他问题或需要进一步帮助,请随时告诉我。

英文:

Say I have a dataframe with two columns like this:

Label 1 Label 2
A B
A C
B C
C A

The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:

Label 1 Label 2 is_A is_B is_C
A B 1 1 0
A C 1 0 1
B C 0 1 1
C A 1 0 1

Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.

Now, I know I could write this using an if_else, like this:

df &lt;- df %&gt;% mutate(is_A = if_else(label1 == &#39;A&#39; | label2 == &#39;A&#39;), 
is_B = if_else(label1 == &#39;B&#39; | label2 == &#39;B&#39;), 
is_C = if_else(label1 == &#39;C&#39; | label2 == &#39;C&#39;))

but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:

encoded_labels &lt;- model.matrix(~ label1 + label2 - 1, data = df)

but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?

答案1

得分: 4

在基础R中,你可以尝试:

cbind(df, unclass(table(row(df), unlist(df))))

另一种方法:

cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))

请注意,对于table,你应该执行:

+unclass(table(row(df), unlist(df)) > 0)

这将考虑具有多个值的行。

如果你想使用model.matrix

+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
英文:

in base R you could Try:

cbind(df, unclass(table(row(df), unlist(df))))

  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Another way:

cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))

  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Note that for the table you should do:

+unclass(table(row(df), unlist(df))&gt;0)

This will take into consideration rows that have multiple values

If you want to use model.matrix:

+Reduce(&quot;|&quot;, split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
  valuesA valuesB valuesC
1       1       1       0
2       1       0       1
3       0       1       1
4       1       0       1

huangapple
  • 本文由 发表于 2023年5月25日 23:44:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76334116.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定