2023年2月19日 09:38:58go评论93阅读模式

英文:

One-hot encoding when data is stored across multiple columns

问题

假设我有一个数据框：

主要颜色	次要颜色	第三颜色
红色	蓝色	绿色
黄色	红色	NA

我想要通过检查颜色是否存在于这三列中来对其进行编码（1），或者不在这三列中（0）。因此，应该得到以下结果：

红色	蓝色	绿色	黄色
1	1	1	0
1	0	0	1

我正在使用R进行工作。我知道可以通过为每种颜色编写大量的ifelse语句来实现这一点，但我的实际问题涉及更多的颜色。有没有更简洁的方法来做到这一点？

英文:

Say I have a dataframe

primary_color	secondary_color	tertiary_color
red	blue	green
yellow	red	NA

and i want this to encode by checking if the color exists across any of the three columns (1) or none of the 3 columns (0). So, it should yield

red	blue	green	yellow
1	1	1	0
1	0	0	1

I'm working in R. I know I could do this by writing out a bunch of ifelse statements for each color, but my actual problem has a lot more colors. Is there a more concise way to do this?

答案1

得分: 2

你可以创建一个新的列来跟踪每一行的行号，将数据转换为长格式，然后通过计算每种颜色的出现次数将其转换回宽格式。

library(dplyr)
library(tidyr)

df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)

red blue green yellow

#1 1 1 1 0
#2 1 1 0 1


**数据**

df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")


<details>
<summary>英文:</summary>
You may create a new column with row number to track each row, get the data in long format and bring it back to wide by counting occurrence of each color.

library(dplyr)
library(tidyr)

df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)

red blue green yellow

<int> <int> <int> <int>

#1 1 1 1 0
#2 1 1 0 1


**data**

df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")


</details>
# 答案2
**得分**: 2
以下是代码的中文翻译部分：
在基本的 R 中，您可以使用 `sapply` 函数与检查所需名称向量的函数：
```R
nnames <- c("red", "blue", "green", "yellow")
new_df <- t(sapply(seq_len(nrow(df),
                   function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) <- nnames
#  red blue green yellow
#1   1    1     1      0
#2   1    0     0      1

请注意，如果您不关心第二个表格中列的顺序，您可以将 nnames 推广为 nnames <- unique(unlist(df[!is.na(df)]))。

数据：

df <- read.table(text = "primary_color    secondary_color    tertiary_color
red    blue    green
yellow    red    NA", header = TRUE)

英文:

In base R you could use sapply with a function that checks the vector of desired names:

nnames &lt;- c(&quot;red&quot;, &quot;blue&quot;, &quot;green&quot;, &quot;yellow&quot;)
new_df &lt;- t(sapply(seq_len(nrow(df)),
                   function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) &lt;- nnames
#  red blue green yellow
#1   1    1     1      0
#2   1    0     0      1

Note if you didnt care about the order of the columns in the second table, you could generalize nnames to nnames <- unique(unlist(df[!is.na(df)]))

Data

df &lt;- read.table(text = &quot;primary_color	secondary_color	tertiary_color
red	blue	green
yellow	red	NA&quot;, h = TRUE)

答案3

得分: 1

使用 mtabulate

library(qdapTools)
mtabulate(as.data.frame(t(df1)))
   blue green red yellow
V1    1     1   1      0
V2    1     0   1      1

或者使用 base R

table(c(row(df1)), unlist(df1))
     blue green red yellow
1    1     1   1      0
2    1     0   1      1

英文:

Using mtabulate

library(qdapTools)
 mtabulate(as.data.frame(t(df1)))
   blue green red yellow
V1    1     1   1      0
V2    1     0   1      1

Or with base R

table(c(row(df1)), unlist(df1))
     blue green red yellow
  1    1     1   1      0
  2    1     0   1      1

答案4

得分: 1

Using outer.

uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |>`colnames<-`(uc)
#      red blue green yellow
# [1,]   1    1     1      0
# [2,]   1    0     0      1

Data:

dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue", 
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA, 
-2L))

英文:

Using outer.

uc &lt;- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |&gt; `colnames&lt;-`(uc)
#      red blue green yellow
# [1,]   1    1     1      0
# [2,]   1    0     0      1

Data:

dat &lt;- structure(list(primary_color = c(&quot;red&quot;, &quot;yellow&quot;), secondary_color = c(&quot;blue&quot;, 
&quot;red&quot;), tertiary_color = c(&quot;green&quot;, NA)), class = &quot;data.frame&quot;, row.names = c(NA, 
-2L))

答案5

得分: 1

在基本的R中：

table(row(df), as.matrix(df))

如果你想要它作为一个数据框：

as.data.frame.matrix(table(row(df), as.matrix(df)))

如果同一行的多列中只有一种颜色：

+(table(row(df), as.matrix(df))>0)

英文:

in base R:

table(row(df), as.matrix(df))
   
    blue green red yellow
  1    1     1   1      0
  2    0     0   1      1

If you want it as a data.frame:

as.data.frame.matrix(table(row(df), as.matrix(df)))
  blue green red yellow
1    1     1   1      0
2    0     0   1      1

If there is one color in many columns of the same row:

 +(table(row(df), as.matrix(df))&gt;0)
   
    blue green red yellow
  1    1     1   1      0
  2    0     0   1      1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

一种独热编码的方法是在数据跨多个列存储时使用。

问题

答案1

red blue green yellow

red blue green yellow

<int> <int> <int> <int>

答案3

答案4

答案5

你可以如何更改在闪亮应用程序中手风琴菜单的字体颜色？

颜色映射在R中使用Plotly不起作用。

如何在R中使用一列的值来定义另一列的边界值？

Using R to plot a stacked bargraph but the legend does not show up, using GridDB as my database.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论