有没有办法避免在这里使用for循环?

huangapple go评论78阅读模式
英文:

Is there a way to avoid a for-loop here?

问题

这是您提供的代码片段的翻译:

我有一个字符变量,其中存储着从0到5的不同长度的数字。我想创建5个虚拟变量,用于显示给定行中是否存在数字(0到5)。我可以通过以下方式实现:

library(data.table)

dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )

for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
  
}

结果如下:

     char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA

由于我的数据集相当大,而且我了解通常最好避免使用for循环,我想知道是否可以在不使用for循环的情况下完成这个任务。我尝试了一些关于.SD、apply和"by = 1:nrow(dataset)"的组合,但都没有成功。

英文:

I have a character variable, which has the numbers from 0 to 5 stored in it with different lengths. I want to create 5 dummy-variables which show if the number (0 to 5) exists in the given row. I am able to achieve this by:

library(data.table)

dataset &lt;- 
  data.table(
    &#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;)
  )


for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0(&#39;Idx_&#39;, i)) := 1]
  
}

Resulting in:

     char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA

Since my dataset is quite large and i learned that is normally a good idea to avoid for-loops, I am curious if its possible to do this without a for-loop. I tried around with combinations of .SD, apply and "by = 1:nrow(dataset)", but non of it worked for me..

答案1

得分: 4

以下是翻译好的部分:

我建议只需稍微修改您当前的方法,以使其变得稍快(在R中,for循环并不总是不好的):

for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}

另一个替代方法:

dataset[, unlist(strsplit(char, " ")), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0("idx_", V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]

希望这对您有帮助。

英文:

I would recommend to just modify your current approach to be slightly faster (for loops are not always bad in R):

for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0(&#39;Idx_&#39;, i), 1L)
}

Another alternative:

dataset[, unlist(strsplit(char, &quot; &quot;)), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0(&quot;idx_&quot;, V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]


#        I idx_0 idx_1 idx_2 idx_3 idx_4 idx_5
#    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
# 1:     1     1     1     1     1    NA    NA
# 2:     2     1     1    NA    NA    NA     1
# 3:     3     1     1     1    NA    NA    NA
# 4:     4     1     1    NA    NA    NA    NA
# 5:     5     1     1     1    NA     1    NA

答案2

得分: 3

以下是代码部分的翻译:

这将是函数式方法:

library(data.table)

dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )

dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]

dataset
#>       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0     1     1     1    NA    NA
#> 2:   1 5 0     1    NA    NA    NA     1
#> 3:   1 2 0     1     1    NA    NA    NA
#> 4:     1 0     1    NA    NA    NA    NA
#> 5: 1 2 4 0     1     1    NA     1    NA

如果需要更多的翻译,请告诉我。

英文:

This would be the functional approach:

library(data.table)

dataset &lt;- 
  data.table(
    &#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;)
  )


dataset[,(paste0(&#39;Idx_&#39;, 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]

dataset
#&gt;       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#&gt; 1: 1 2 3 0     1     1     1    NA    NA
#&gt; 2:   1 5 0     1    NA    NA    NA     1
#&gt; 3:   1 2 0     1     1    NA    NA    NA
#&gt; 4:     1 0     1    NA    NA    NA    NA
#&gt; 5: 1 2 4 0     1     1    NA     1    NA

答案3

得分: 1

In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:

#example data with 11 and 23
d &lt;- data.table(char = c(&#39;1 2 3 0&#39;, 
                         &#39;11 5 0&#39;, 
                         &#39;1 23 0&#39;, 
                         &#39;1 0&#39;,
                         &#39;1 2 4 0&#39;))

# get number max of columns
colMax &lt;- max(stringr::str_count(d$char, &quot; &quot;)) + 1

d[, paste0(&quot;c&quot;, seq.int(colMax)) := tstrsplit(char, split = &quot; &quot;, type.convert = TRUE) 
        ][, melt(.SD, id.vars = &quot;char&quot;) 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]

#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0
英文:

In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:

#example data with 11 and 23
d &lt;- data.table(char = c(&#39;1 2 3 0&#39;, 
                         &#39;11 5 0&#39;, 
                         &#39;1 23 0&#39;, 
                         &#39;1 0&#39;,
                         &#39;1 2 4 0&#39;))

# get number max of columns
colMax &lt;- max(stringr::str_count(d$char, &quot; &quot;)) + 1

d[, paste0(&quot;c&quot;, seq.int(colMax)) := tstrsplit(char, split = &quot; &quot;, type.convert = TRUE) 
        ][, melt(.SD, id.vars = &quot;char&quot;) 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]

#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0

答案4

得分: 0

这是一个 Base R 的解决方案,如果你的 data.frame 很大,你可以使用包 parallelparallel::parlapply 替代外部的 lapply

# 我使用一个普通的数据框
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))

# 从内到外,首先我们将字符串按空格分割,然后转换为数值,接着匹配从0到5的所有数字,从而得到新数据框中每个数字的列
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), function(numbers) {
  
  # 使用match函数来查看0到5中哪些数字在相应的行中
  seq(0, 5) %in% numbers
  
})) -> matched_res

# 修复列名
colnames(matched_res) <- paste0("ind_", 0:5)

# 绑定
cbind(dataset, matched_res)

#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE

希望这对你有所帮助。

英文:

This is a Base R solution, if your data.frame is very large you can use the package parallel and parallel::parlapply instead of the outer lapply.

# I use a normal data frame instead
dataset &lt;- data.frame(&#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;))

# reading from inside out, we first split the strings on whitespaces convert to 
# numeric and then match all digits from 0:5 thus obtaining a column for ever digit 
# in our new df
do.call(rbind, lapply(sapply(strsplit(dataset[, &quot;char&quot;], &quot;\\s&quot;), as.numeric), \(numbers){
  
  # use match to see which of th edigits 0:5 is in the respective row
  seq(0, 5) %in% numbers
  
  
})) -&gt; matched_res

# fix colnames 
colnames(matched_res) &lt;- paste0(&quot;ind_&quot;, 0:5)

# bind
cbind(dataset, matched_res)

#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE

答案5

得分: -1

A Tidyverse approach just for the record (not trying to compete in terms of speed here...):

library(tidyverse)

df &lt;- tibble(&#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;))

df |&gt; 
  mutate(row = row_number(), .before = everything()) |&gt; 
  separate_longer_delim(char, delim = &quot; &quot;) |&gt; 
  arrange(char) |&gt; 
  pivot_wider(
    names_from = char, 
    names_prefix = &quot;Idx_&quot;,
    values_from = char, 
    values_fn = \(x) 1
  ) |&gt; 
  select(!row) |&gt; 
  mutate(char = df$char, .before = everything())
#&gt; # A tibble: 5 &#215; 7
#&gt;   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#&gt;   &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3 0     1     1     1     1    NA    NA
#&gt; 2 1 5 0       1     1    NA    NA    NA     1
#&gt; 3 1 2 0       1     1     1    NA    NA    NA
#&gt; 4 1 0         1     1    NA    NA    NA    NA
#&gt; 5 1 2 4 0     1     1     1    NA     1    NA

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

英文:

A Tidyverse approach just for the record (not trying to compete in terms of speed here...):

library(tidyverse)

df &lt;- tibble(&#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;))

df |&gt; 
  mutate(row = row_number(), .before = everything()) |&gt; 
  separate_longer_delim(char, delim = &quot; &quot;) |&gt; 
  arrange(char) |&gt; 
  pivot_wider(
    names_from = char, 
    names_prefix = &quot;Idx_&quot;,
    values_from = char, 
    values_fn = \(x) 1
  ) |&gt; 
  select(!row) |&gt; 
  mutate(char = df$char, .before = everything())
#&gt; # A tibble: 5 &#215; 7
#&gt;   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#&gt;   &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3 0     1     1     1     1    NA    NA
#&gt; 2 1 5 0       1     1    NA    NA    NA     1
#&gt; 3 1 2 0       1     1     1    NA    NA    NA
#&gt; 4 1 0         1     1    NA    NA    NA    NA
#&gt; 5 1 2 4 0     1     1     1    NA     1    NA

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年4月13日 17:56:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76004085.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定