英文:
Is there a way to avoid a for-loop here?
问题
这是您提供的代码片段的翻译:
我有一个字符变量,其中存储着从0到5的不同长度的数字。我想创建5个虚拟变量,用于显示给定行中是否存在数字(0到5)。我可以通过以下方式实现:
library(data.table)
dataset <-
data.table(
'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
)
for(i in c(0:5)){
dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
}
结果如下:
char Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0 1 1 1 1 NA NA
2: 1 5 0 1 1 NA NA NA 1
3: 1 2 0 1 1 1 NA NA NA
4: 1 0 1 1 NA NA NA NA
5: 1 2 4 0 1 1 1 NA 1 NA
由于我的数据集相当大,而且我了解通常最好避免使用for循环,我想知道是否可以在不使用for循环的情况下完成这个任务。我尝试了一些关于.SD、apply和"by = 1:nrow(dataset)"的组合,但都没有成功。
英文:
I have a character variable, which has the numbers from 0 to 5 stored in it with different lengths. I want to create 5 dummy-variables which show if the number (0 to 5) exists in the given row. I am able to achieve this by:
library(data.table)
dataset <-
data.table(
'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
)
for(i in c(0:5)){
dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
}
Resulting in:
char Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0 1 1 1 1 NA NA
2: 1 5 0 1 1 NA NA NA 1
3: 1 2 0 1 1 1 NA NA NA
4: 1 0 1 1 NA NA NA NA
5: 1 2 4 0 1 1 1 NA 1 NA
Since my dataset is quite large and i learned that is normally a good idea to avoid for-loops, I am curious if its possible to do this without a for-loop. I tried around with combinations of .SD, apply and "by = 1:nrow(dataset)", but non of it worked for me..
答案1
得分: 4
以下是翻译好的部分:
我建议只需稍微修改您当前的方法,以使其变得稍快(在R中,for循环并不总是不好的):
for (i in 0:5) {
set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}
另一个替代方法:
dataset[, unlist(strsplit(char, " ")), by = .I
][, dcast(
.SD,
I ~ paste0("idx_", V1),
fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
)]
希望这对您有帮助。
英文:
I would recommend to just modify your current approach to be slightly faster (for loops are not always bad in R):
for (i in 0:5) {
set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}
Another alternative:
dataset[, unlist(strsplit(char, " ")), by = .I
][, dcast(
.SD,
I ~ paste0("idx_", V1),
fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
)]
# I idx_0 idx_1 idx_2 idx_3 idx_4 idx_5
# <int> <int> <int> <int> <int> <int> <int>
# 1: 1 1 1 1 1 NA NA
# 2: 2 1 1 NA NA NA 1
# 3: 3 1 1 1 NA NA NA
# 4: 4 1 1 NA NA NA NA
# 5: 5 1 1 1 NA 1 NA
答案2
得分: 3
以下是代码部分的翻译:
这将是函数式方法:
library(data.table)
dataset <-
data.table(
'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
)
dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]
dataset
#> char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0 1 1 1 NA NA
#> 2: 1 5 0 1 NA NA NA 1
#> 3: 1 2 0 1 1 NA NA NA
#> 4: 1 0 1 NA NA NA NA
#> 5: 1 2 4 0 1 1 NA 1 NA
如果需要更多的翻译,请告诉我。
英文:
This would be the functional approach:
library(data.table)
dataset <-
data.table(
'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
)
dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]
dataset
#> char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0 1 1 1 NA NA
#> 2: 1 5 0 1 NA NA NA 1
#> 3: 1 2 0 1 1 NA NA NA
#> 4: 1 0 1 NA NA NA NA
#> 5: 1 2 4 0 1 1 NA 1 NA
答案3
得分: 1
In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:
#example data with 11 and 23
d <- data.table(char = c('1 2 3 0',
'11 5 0',
'1 23 0',
'1 0',
'1 2 4 0'))
# get number max of columns
colMax <- max(stringr::str_count(d$char, " ")) + 1
d[, paste0("c", seq.int(colMax)) := tstrsplit(char, split = " ", type.convert = TRUE)
][, melt(.SD, id.vars = "char")
][ !is.na(value), dcast(.SD, char ~ value,
fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]
# char 0 1 2 3 4 5 11 23
# 1: 1 0 1 1 0 0 0 0 0 0
# 2: 1 2 3 0 1 1 1 1 0 0 0 0
# 3: 1 2 4 0 1 1 1 0 1 0 0 0
# 4: 1 23 0 1 1 0 0 0 0 0 1
# 5: 11 5 0 1 0 0 0 0 1 1 0
英文:
In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:
#example data with 11 and 23
d <- data.table(char = c('1 2 3 0',
'11 5 0',
'1 23 0',
'1 0',
'1 2 4 0'))
# get number max of columns
colMax <- max(stringr::str_count(d$char, " ")) + 1
d[, paste0("c", seq.int(colMax)) := tstrsplit(char, split = " ", type.convert = TRUE)
][, melt(.SD, id.vars = "char")
][ !is.na(value), dcast(.SD, char ~ value,
fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]
# char 0 1 2 3 4 5 11 23
# 1: 1 0 1 1 0 0 0 0 0 0
# 2: 1 2 3 0 1 1 1 1 0 0 0 0
# 3: 1 2 4 0 1 1 1 0 1 0 0 0
# 4: 1 23 0 1 1 0 0 0 0 0 1
# 5: 11 5 0 1 0 0 0 0 1 1 0
答案4
得分: 0
这是一个 Base R
的解决方案,如果你的 data.frame
很大,你可以使用包 parallel
和 parallel::parlapply
替代外部的 lapply
。
# 我使用一个普通的数据框
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
# 从内到外,首先我们将字符串按空格分割,然后转换为数值,接着匹配从0到5的所有数字,从而得到新数据框中每个数字的列
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), function(numbers) {
# 使用match函数来查看0到5中哪些数字在相应的行中
seq(0, 5) %in% numbers
})) -> matched_res
# 修复列名
colnames(matched_res) <- paste0("ind_", 0:5)
# 绑定
cbind(dataset, matched_res)
# char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0 TRUE TRUE TRUE TRUE FALSE FALSE
# 2 1 5 0 TRUE TRUE FALSE FALSE FALSE TRUE
# 3 1 2 0 TRUE TRUE TRUE FALSE FALSE FALSE
# 4 1 0 TRUE TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0 TRUE TRUE TRUE FALSE TRUE FALSE
希望这对你有所帮助。
英文:
This is a Base R
solution, if your data.frame
is very large you can use the package parallel and parallel::parlapply
instead of the outer lapply.
# I use a normal data frame instead
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
# reading from inside out, we first split the strings on whitespaces convert to
# numeric and then match all digits from 0:5 thus obtaining a column for ever digit
# in our new df
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), \(numbers){
# use match to see which of th edigits 0:5 is in the respective row
seq(0, 5) %in% numbers
})) -> matched_res
# fix colnames
colnames(matched_res) <- paste0("ind_", 0:5)
# bind
cbind(dataset, matched_res)
# char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0 TRUE TRUE TRUE TRUE FALSE FALSE
# 2 1 5 0 TRUE TRUE FALSE FALSE FALSE TRUE
# 3 1 2 0 TRUE TRUE TRUE FALSE FALSE FALSE
# 4 1 0 TRUE TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0 TRUE TRUE TRUE FALSE TRUE FALSE
答案5
得分: -1
A Tidyverse approach just for the record (not trying to compete in terms of speed here...):
library(tidyverse)
df <- tibble('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
df |>
mutate(row = row_number(), .before = everything()) |>
separate_longer_delim(char, delim = " ") |>
arrange(char) |>
pivot_wider(
names_from = char,
names_prefix = "Idx_",
values_from = char,
values_fn = \(x) 1
) |>
select(!row) |>
mutate(char = df$char, .before = everything())
#> # A tibble: 5 × 7
#> char Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 0 1 1 1 1 NA NA
#> 2 1 5 0 1 1 NA NA NA 1
#> 3 1 2 0 1 1 1 NA NA NA
#> 4 1 0 1 1 NA NA NA NA
#> 5 1 2 4 0 1 1 1 NA 1 NA
<sup>Created on 2023-04-13 with reprex v2.0.2</sup>
英文:
A Tidyverse approach just for the record (not trying to compete in terms of speed here...):
library(tidyverse)
df <- tibble('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
df |>
mutate(row = row_number(), .before = everything()) |>
separate_longer_delim(char, delim = " ") |>
arrange(char) |>
pivot_wider(
names_from = char,
names_prefix = "Idx_",
values_from = char,
values_fn = \(x) 1
) |>
select(!row) |>
mutate(char = df$char, .before = everything())
#> # A tibble: 5 × 7
#> char Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 0 1 1 1 1 NA NA
#> 2 1 5 0 1 1 NA NA NA 1
#> 3 1 2 0 1 1 1 NA NA NA
#> 4 1 0 1 1 NA NA NA NA
#> 5 1 2 4 0 1 1 1 NA 1 NA
<sup>Created on 2023-04-13 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论