英文:
How can I one-hot-encode multiple columns in R that share categories?
问题
以下是您要求的代码部分的翻译:
df <- df %>%
mutate(is_A = if_else(label1 == 'A' | label2 == 'A', 1, 0),
is_B = if_else(label1 == 'B' | label2 == 'B', 1, 0),
is_C = if_else(label1 == 'C' | label2 == 'C', 1, 0))
encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)
这些代码将会根据条件在DataFrame中添加is_A
、is_B
和is_C
列,其中1表示条件满足,0表示条件不满足。这将为您提供所需的编码。如果您有其他问题或需要进一步帮助,请随时告诉我。
英文:
Say I have a dataframe with two columns like this:
Label 1 | Label 2 |
---|---|
A | B |
A | C |
B | C |
C | A |
The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:
Label 1 | Label 2 | is_A | is_B | is_C |
---|---|---|---|---|
A | B | 1 | 1 | 0 |
A | C | 1 | 0 | 1 |
B | C | 0 | 1 | 1 |
C | A | 1 | 0 | 1 |
Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.
Now, I know I could write this using an if_else
, like this:
df <- df %>% mutate(is_A = if_else(label1 == 'A' | label2 == 'A'),
is_B = if_else(label1 == 'B' | label2 == 'B'),
is_C = if_else(label1 == 'C' | label2 == 'C'))
but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:
encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)
but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?
答案1
得分: 4
在基础R中,你可以尝试:
cbind(df, unclass(table(row(df), unlist(df))))
另一种方法:
cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))
请注意,对于table
,你应该执行:
+unclass(table(row(df), unlist(df)) > 0)
这将考虑具有多个值的行。
如果你想使用model.matrix
:
+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
英文:
in base R you could Try:
cbind(df, unclass(table(row(df), unlist(df))))
Label_1 Label_2 A B C
1 A B 1 1 0
2 A C 1 0 1
3 B C 0 1 1
4 C A 1 0 1
Another way:
cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))
Label_1 Label_2 A B C
1 A B 1 1 0
2 A C 1 0 1
3 B C 0 1 1
4 C A 1 0 1
Note that for the table
you should do:
+unclass(table(row(df), unlist(df))>0)
This will take into consideration rows that have multiple values
If you want to use model.matrix
:
+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
valuesA valuesB valuesC
1 1 1 0
2 1 0 1
3 0 1 1
4 1 0 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论