英文:
One-hot encoding when data is stored across multiple columns
问题
假设我有一个数据框:
主要颜色 | 次要颜色 | 第三颜色 |
---|---|---|
红色 | 蓝色 | 绿色 |
黄色 | 红色 | NA |
我想要通过检查颜色是否存在于这三列中来对其进行编码(1),或者不在这三列中(0)。因此,应该得到以下结果:
红色 | 蓝色 | 绿色 | 黄色 |
---|---|---|---|
1 | 1 | 1 | 0 |
1 | 0 | 0 | 1 |
我正在使用R进行工作。我知道可以通过为每种颜色编写大量的ifelse语句来实现这一点,但我的实际问题涉及更多的颜色。有没有更简洁的方法来做到这一点?
英文:
Say I have a dataframe
primary_color | secondary_color | tertiary_color |
---|---|---|
red | blue | green |
yellow | red | NA |
and i want this to encode by checking if the color exists across any of the three columns (1) or none of the 3 columns (0). So, it should yield
red | blue | green | yellow |
---|---|---|---|
1 | 1 | 1 | 0 |
1 | 0 | 0 | 1 |
I'm working in R. I know I could do this by writing out a bunch of ifelse statements for each color, but my actual problem has a lot more colors. Is there a more concise way to do this?
答案1
得分: 2
你可以创建一个新的列来跟踪每一行的行号,将数据转换为长格式,然后通过计算每种颜色的出现次数将其转换回宽格式。
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)
red blue green yellow
#1 1 1 1 0
#2 1 1 0 1
**数据**
df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")
<details>
<summary>英文:</summary>
You may create a new column with row number to track each row, get the data in long format and bring it back to wide by counting occurrence of each color.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)
red blue green yellow
<int> <int> <int> <int>
#1 1 1 1 0
#2 1 1 0 1
**data**
df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")
</details>
# 答案2
**得分**: 2
以下是代码的中文翻译部分:
在基本的 R 中,您可以使用 `sapply` 函数与检查所需名称向量的函数:
```R
nnames <- c("red", "blue", "green", "yellow")
new_df <- t(sapply(seq_len(nrow(df),
function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) <- nnames
# red blue green yellow
#1 1 1 1 0
#2 1 0 0 1
请注意,如果您不关心第二个表格中列的顺序,您可以将 nnames
推广为 nnames <- unique(unlist(df[!is.na(df)]))
。
数据:
df <- read.table(text = "primary_color secondary_color tertiary_color
red blue green
yellow red NA", header = TRUE)
英文:
In base R you could use sapply
with a function that checks the vector of desired names:
nnames <- c("red", "blue", "green", "yellow")
new_df <- t(sapply(seq_len(nrow(df)),
function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) <- nnames
# red blue green yellow
#1 1 1 1 0
#2 1 0 0 1
Note if you didnt care about the order of the columns in the second table, you could generalize nnames
to nnames <- unique(unlist(df[!is.na(df)]))
Data
df <- read.table(text = "primary_color secondary_color tertiary_color
red blue green
yellow red NA", h = TRUE)
答案3
得分: 1
使用 mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(df1)))
blue green red yellow
V1 1 1 1 0
V2 1 0 1 1
或者使用 base R
table(c(row(df1)), unlist(df1))
blue green red yellow
1 1 1 1 0
2 1 0 1 1
英文:
Using mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(df1)))
blue green red yellow
V1 1 1 1 0
V2 1 0 1 1
Or with base R
table(c(row(df1)), unlist(df1))
blue green red yellow
1 1 1 1 0
2 1 0 1 1
答案4
得分: 1
Using outer
.
uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |>`colnames<-`(uc)
# red blue green yellow
# [1,] 1 1 1 0
# [2,] 1 0 0 1
Data:
dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue",
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA,
-2L))
英文:
Using outer
.
uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |> `colnames<-`(uc)
# red blue green yellow
# [1,] 1 1 1 0
# [2,] 1 0 0 1
Data:
dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue",
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA,
-2L))
答案5
得分: 1
在基本的R中:
table(row(df), as.matrix(df))
如果你想要它作为一个数据框:
as.data.frame.matrix(table(row(df), as.matrix(df)))
如果同一行的多列中只有一种颜色:
+(table(row(df), as.matrix(df))>0)
英文:
in base R:
table(row(df), as.matrix(df))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If you want it as a data.frame:
as.data.frame.matrix(table(row(df), as.matrix(df)))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If there is one color in many columns of the same row:
+(table(row(df), as.matrix(df))>0)
blue green red yellow
1 1 1 1 0
2 0 0 1 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论