英文:
Is there a way to avoid a for-loop here?
问题
这是您提供的代码片段的翻译:
我有一个字符变量,其中存储着从0到5的不同长度的数字。我想创建5个虚拟变量,用于显示给定行中是否存在数字(0到5)。我可以通过以下方式实现:
library(data.table)
dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )
for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
  
}
结果如下:
     char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA
由于我的数据集相当大,而且我了解通常最好避免使用for循环,我想知道是否可以在不使用for循环的情况下完成这个任务。我尝试了一些关于.SD、apply和"by = 1:nrow(dataset)"的组合,但都没有成功。
英文:
I have a character variable, which has the numbers from 0 to 5 stored in it with different lengths. I want to create 5 dummy-variables which show if the number (0 to 5) exists in the given row. I am able to achieve this by:
library(data.table)
dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )
for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
  
}
Resulting in:
     char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA
Since my dataset is quite large and i learned that is normally a good idea to avoid for-loops, I am curious if its possible to do this without a for-loop. I tried around with combinations of .SD, apply and "by = 1:nrow(dataset)", but non of it worked for me..
答案1
得分: 4
以下是翻译好的部分:
我建议只需稍微修改您当前的方法,以使其变得稍快(在R中,for循环并不总是不好的):
for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}
另一个替代方法:
dataset[, unlist(strsplit(char, " ")), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0("idx_", V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]
希望这对您有帮助。
英文:
I would recommend to just modify your current approach to be slightly faster (for loops are not always bad in R):
for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}
Another alternative:
dataset[, unlist(strsplit(char, " ")), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0("idx_", V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]
#        I idx_0 idx_1 idx_2 idx_3 idx_4 idx_5
#    <int> <int> <int> <int> <int> <int> <int>
# 1:     1     1     1     1     1    NA    NA
# 2:     2     1     1    NA    NA    NA     1
# 3:     3     1     1     1    NA    NA    NA
# 4:     4     1     1    NA    NA    NA    NA
# 5:     5     1     1     1    NA     1    NA
答案2
得分: 3
以下是代码部分的翻译:
这将是函数式方法:
library(data.table)
dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )
dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]
dataset
#>       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0     1     1     1    NA    NA
#> 2:   1 5 0     1    NA    NA    NA     1
#> 3:   1 2 0     1     1    NA    NA    NA
#> 4:     1 0     1    NA    NA    NA    NA
#> 5: 1 2 4 0     1     1    NA     1    NA
如果需要更多的翻译,请告诉我。
英文:
This would be the functional approach:
library(data.table)
dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )
dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]
dataset
#>       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0     1     1     1    NA    NA
#> 2:   1 5 0     1    NA    NA    NA     1
#> 3:   1 2 0     1     1    NA    NA    NA
#> 4:     1 0     1    NA    NA    NA    NA
#> 5: 1 2 4 0     1     1    NA     1    NA
答案3
得分: 1
In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:
#example data with 11 and 23
d <- data.table(char = c('1 2 3 0', 
                         '11 5 0', 
                         '1 23 0', 
                         '1 0',
                         '1 2 4 0'))
# get number max of columns
colMax <- max(stringr::str_count(d$char, " ")) + 1
d[, paste0("c", seq.int(colMax)) := tstrsplit(char, split = " ", type.convert = TRUE) 
        ][, melt(.SD, id.vars = "char") 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]
#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0
英文:
In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:
#example data with 11 and 23
d <- data.table(char = c('1 2 3 0', 
                         '11 5 0', 
                         '1 23 0', 
                         '1 0',
                         '1 2 4 0'))
# get number max of columns
colMax <- max(stringr::str_count(d$char, " ")) + 1
d[, paste0("c", seq.int(colMax)) := tstrsplit(char, split = " ", type.convert = TRUE) 
        ][, melt(.SD, id.vars = "char") 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]
#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0
答案4
得分: 0
这是一个 Base R 的解决方案,如果你的 data.frame 很大,你可以使用包 parallel 和 parallel::parlapply 替代外部的 lapply。
# 我使用一个普通的数据框
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
# 从内到外,首先我们将字符串按空格分割,然后转换为数值,接着匹配从0到5的所有数字,从而得到新数据框中每个数字的列
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), function(numbers) {
  
  # 使用match函数来查看0到5中哪些数字在相应的行中
  seq(0, 5) %in% numbers
  
})) -> matched_res
# 修复列名
colnames(matched_res) <- paste0("ind_", 0:5)
# 绑定
cbind(dataset, matched_res)
#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE
希望这对你有所帮助。
英文:
This is a Base R solution, if your data.frame is very large you can use the package parallel and parallel::parlapply instead of the outer lapply.
# I use a normal data frame instead
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
# reading from inside out, we first split the strings on whitespaces convert to 
# numeric and then match all digits from 0:5 thus obtaining a column for ever digit 
# in our new df
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), \(numbers){
  
  # use match to see which of th edigits 0:5 is in the respective row
  seq(0, 5) %in% numbers
  
  
})) -> matched_res
# fix colnames 
colnames(matched_res) <- paste0("ind_", 0:5)
# bind
cbind(dataset, matched_res)
#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE
答案5
得分: -1
A Tidyverse approach just for the record (not trying to compete in terms of speed here...):
library(tidyverse)
df <- tibble('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
df |> 
  mutate(row = row_number(), .before = everything()) |> 
  separate_longer_delim(char, delim = " ") |> 
  arrange(char) |> 
  pivot_wider(
    names_from = char, 
    names_prefix = "Idx_",
    values_from = char, 
    values_fn = \(x) 1
  ) |> 
  select(!row) |> 
  mutate(char = df$char, .before = everything())
#> # A tibble: 5 × 7
#>   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 0     1     1     1     1    NA    NA
#> 2 1 5 0       1     1    NA    NA    NA     1
#> 3 1 2 0       1     1     1    NA    NA    NA
#> 4 1 0         1     1    NA    NA    NA    NA
#> 5 1 2 4 0     1     1     1    NA     1    NA
<sup>Created on 2023-04-13 with reprex v2.0.2</sup>
英文:
A Tidyverse approach just for the record (not trying to compete in terms of speed here...):
library(tidyverse)
df <- tibble('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))
df |> 
  mutate(row = row_number(), .before = everything()) |> 
  separate_longer_delim(char, delim = " ") |> 
  arrange(char) |> 
  pivot_wider(
    names_from = char, 
    names_prefix = "Idx_",
    values_from = char, 
    values_fn = \(x) 1
  ) |> 
  select(!row) |> 
  mutate(char = df$char, .before = everything())
#> # A tibble: 5 × 7
#>   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 0     1     1     1     1    NA    NA
#> 2 1 5 0       1     1    NA    NA    NA     1
#> 3 1 2 0       1     1     1    NA    NA    NA
#> 4 1 0         1     1    NA    NA    NA    NA
#> 5 1 2 4 0     1     1     1    NA     1    NA
<sup>Created on 2023-04-13 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论