一种独热编码的方法是在数据跨多个列存储时使用。

huangapple go评论74阅读模式
英文:

One-hot encoding when data is stored across multiple columns

问题

假设我有一个数据框:

主要颜色 次要颜色 第三颜色
红色 蓝色 绿色
黄色 红色 NA

我想要通过检查颜色是否存在于这三列中来对其进行编码(1),或者不在这三列中(0)。因此,应该得到以下结果:

红色 蓝色 绿色 黄色
1 1 1 0
1 0 0 1

我正在使用R进行工作。我知道可以通过为每种颜色编写大量的ifelse语句来实现这一点,但我的实际问题涉及更多的颜色。有没有更简洁的方法来做到这一点?

英文:

Say I have a dataframe

primary_color secondary_color tertiary_color
red blue green
yellow red NA

and i want this to encode by checking if the color exists across any of the three columns (1) or none of the 3 columns (0). So, it should yield

red blue green yellow
1 1 1 0
1 0 0 1

I'm working in R. I know I could do this by writing out a bunch of ifelse statements for each color, but my actual problem has a lot more colors. Is there a more concise way to do this?

答案1

得分: 2

你可以创建一个新的列来跟踪每一行的行号,将数据转换为长格式,然后通过计算每种颜色的出现次数将其转换回宽格式。

library(dplyr)
library(tidyr)

df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)

red blue green yellow

#1 1 1 1 0
#2 1 1 0 1


**数据**

df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")


<details>
<summary>英文:</summary>

You may create a new column with row number to track each row, get the data in long format and bring it back to wide by counting occurrence of each color. 

library(dplyr)
library(tidyr)

df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)

red blue green yellow

<int> <int> <int> <int>

#1 1 1 1 0
#2 1 1 0 1


**data**

df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")


</details>



# 答案2
**得分**: 2

以下是代码的中文翻译部分:

在基本的 R 中,您可以使用 `sapply` 函数与检查所需名称向量的函数:

```R
nnames <- c("red", "blue", "green", "yellow")

new_df <- t(sapply(seq_len(nrow(df),
                   function(x)(nnames %in% df[x, ]) * 1))

colnames(new_df) <- nnames

#  red blue green yellow
#1   1    1     1      0
#2   1    0     0      1

请注意,如果您不关心第二个表格中列的顺序,您可以将 nnames 推广为 nnames <- unique(unlist(df[!is.na(df)]))

数据:

df <- read.table(text = "primary_color    secondary_color    tertiary_color
red    blue    green
yellow    red    NA", header = TRUE)
英文:

In base R you could use sapply with a function that checks the vector of desired names:

nnames &lt;- c(&quot;red&quot;, &quot;blue&quot;, &quot;green&quot;, &quot;yellow&quot;)

new_df &lt;- t(sapply(seq_len(nrow(df)),
                   function(x)(nnames %in% df[x, ]) * 1))

colnames(new_df) &lt;- nnames

#  red blue green yellow
#1   1    1     1      0
#2   1    0     0      1

Note if you didnt care about the order of the columns in the second table, you could generalize nnames to nnames &lt;- unique(unlist(df[!is.na(df)]))

Data

df &lt;- read.table(text = &quot;primary_color	secondary_color	tertiary_color
red	blue	green
yellow	red	NA&quot;, h = TRUE)

答案3

得分: 1

使用 mtabulate

library(qdapTools)
mtabulate(as.data.frame(t(df1)))
   blue green red yellow
V1    1     1   1      0
V2    1     0   1      1

或者使用 base R

table(c(row(df1)), unlist(df1))
     blue green red yellow
1    1     1   1      0
2    1     0   1      1
英文:

Using mtabulate

library(qdapTools)
 mtabulate(as.data.frame(t(df1)))
   blue green red yellow
V1    1     1   1      0
V2    1     0   1      1

Or with base R

table(c(row(df1)), unlist(df1))
     blue green red yellow
  1    1     1   1      0
  2    1     0   1      1

答案4

得分: 1

Using outer.

uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |>`colnames<-`(uc)
#      red blue green yellow
# [1,]   1    1     1      0
# [2,]   1    0     0      1

Data:

dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue", 
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA, 
-2L))
英文:

Using outer.

uc &lt;- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |&gt; `colnames&lt;-`(uc)
#      red blue green yellow
# [1,]   1    1     1      0
# [2,]   1    0     0      1

Data:

dat &lt;- structure(list(primary_color = c(&quot;red&quot;, &quot;yellow&quot;), secondary_color = c(&quot;blue&quot;, 
&quot;red&quot;), tertiary_color = c(&quot;green&quot;, NA)), class = &quot;data.frame&quot;, row.names = c(NA, 
-2L))

答案5

得分: 1

在基本的R中:

table(row(df), as.matrix(df))

如果你想要它作为一个数据框:

as.data.frame.matrix(table(row(df), as.matrix(df)))

如果同一行的多列中只有一种颜色:

+(table(row(df), as.matrix(df))>0)
英文:

in base R:

table(row(df), as.matrix(df))
   
    blue green red yellow
  1    1     1   1      0
  2    0     0   1      1

If you want it as a data.frame:

as.data.frame.matrix(table(row(df), as.matrix(df)))

  blue green red yellow
1    1     1   1      0
2    0     0   1      1

If there is one color in many columns of the same row:

 +(table(row(df), as.matrix(df))&gt;0)
   
    blue green red yellow
  1    1     1   1      0
  2    0     0   1      1

huangapple
  • 本文由 发表于 2023年2月19日 09:38:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75497485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定