2023年5月25日 23:44:53go评论83阅读模式

英文:

How can I one-hot-encode multiple columns in R that share categories?

问题

以下是您要求的代码部分的翻译：

df <- df %>% 
  mutate(is_A = if_else(label1 == 'A' | label2 == 'A', 1, 0),
         is_B = if_else(label1 == 'B' | label2 == 'B', 1, 0),
         is_C = if_else(label1 == 'C' | label2 == 'C', 1, 0))

encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)

这些代码将会根据条件在DataFrame中添加is_A、is_B和is_C列，其中1表示条件满足，0表示条件不满足。这将为您提供所需的编码。如果您有其他问题或需要进一步帮助，请随时告诉我。

英文:

Say I have a dataframe with two columns like this:

Label 1	Label 2
A	B
A	C
B	C
C	A

The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:

Label 1	Label 2	is_A	is_B	is_C
A	B	1	1	0
A	C	1	0	1
B	C	0	1	1
C	A	1	0	1

Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.

Now, I know I could write this using an if_else, like this:

df &lt;- df %&gt;% mutate(is_A = if_else(label1 == &#39;A&#39; | label2 == &#39;A&#39;), 
is_B = if_else(label1 == &#39;B&#39; | label2 == &#39;B&#39;), 
is_C = if_else(label1 == &#39;C&#39; | label2 == &#39;C&#39;))

but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:

encoded_labels &lt;- model.matrix(~ label1 + label2 - 1, data = df)

but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?

答案1

得分: 4

在基础R中，你可以尝试：

cbind(df, unclass(table(row(df), unlist(df))))

另一种方法：

cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))

请注意，对于table，你应该执行：

+unclass(table(row(df), unlist(df)) > 0)

这将考虑具有多个值的行。

如果你想使用model.matrix：

+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))

英文:

in base R you could Try:

cbind(df, unclass(table(row(df), unlist(df))))
  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Another way:

cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))
  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Note that for the table you should do:

+unclass(table(row(df), unlist(df))&gt;0)

This will take into consideration rows that have multiple values

If you want to use model.matrix:

+Reduce(&quot;|&quot;, split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
  valuesA valuesB valuesC
1       1       1       0
2       1       0       1
3       0       1       1
4       1       0       1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中对共享类别的多列进行独热编码？

问题

答案1

How to create a new colum that identifies the last and second last row in longitudinal data using dplyr

在R中使用data.table的j位置上的函数。

我可以帮你翻译这句话：如何在R中着色特定的县？

提取第二个下划线和点号之前的字符串： R

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。