2023年4月13日 17:56:16go评论92阅读模式

英文:

Is there a way to avoid a for-loop here?

问题

这是您提供的代码片段的翻译：

我有一个字符变量，其中存储着从0到5的不同长度的数字。我想创建5个虚拟变量，用于显示给定行中是否存在数字（0到5）。我可以通过以下方式实现：

library(data.table)

dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )

for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
  
}

结果如下：

     char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA

由于我的数据集相当大，而且我了解通常最好避免使用for循环，我想知道是否可以在不使用for循环的情况下完成这个任务。我尝试了一些关于.SD、apply和"by = 1:nrow(dataset)"的组合，但都没有成功。

英文:

I have a character variable, which has the numbers from 0 to 5 stored in it with different lengths. I want to create 5 dummy-variables which show if the number (0 to 5) exists in the given row. I am able to achieve this by:

library(data.table)

dataset &lt;- 
  data.table(
    &#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;)
  )


for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0(&#39;Idx_&#39;, i)) := 1]
  
}

Resulting in:

     char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA

Since my dataset is quite large and i learned that is normally a good idea to avoid for-loops, I am curious if its possible to do this without a for-loop. I tried around with combinations of .SD, apply and "by = 1:nrow(dataset)", but non of it worked for me..

答案1

得分: 4

以下是翻译好的部分：

我建议只需稍微修改您当前的方法，以使其变得稍快（在R中，for循环并不总是不好的）：

for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}

另一个替代方法：

dataset[, unlist(strsplit(char, " ")), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0("idx_", V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]

希望这对您有帮助。

英文:

I would recommend to just modify your current approach to be slightly faster (for loops are not always bad in R):

for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0(&#39;Idx_&#39;, i), 1L)
}

Another alternative:

dataset[, unlist(strsplit(char, &quot; &quot;)), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0(&quot;idx_&quot;, V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]


#        I idx_0 idx_1 idx_2 idx_3 idx_4 idx_5
#    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
# 1:     1     1     1     1     1    NA    NA
# 2:     2     1     1    NA    NA    NA     1
# 3:     3     1     1     1    NA    NA    NA
# 4:     4     1     1    NA    NA    NA    NA
# 5:     5     1     1     1    NA     1    NA

答案2

得分: 3

以下是代码部分的翻译：

这将是函数式方法：

library(data.table)

dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )

dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]

dataset
#>       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0     1     1     1    NA    NA
#> 2:   1 5 0     1    NA    NA    NA     1
#> 3:   1 2 0     1     1    NA    NA    NA
#> 4:     1 0     1    NA    NA    NA    NA
#> 5: 1 2 4 0     1     1    NA     1    NA

如果需要更多的翻译，请告诉我。

英文:

This would be the functional approach:

library(data.table)

dataset &lt;- 
  data.table(
    &#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;)
  )


dataset[,(paste0(&#39;Idx_&#39;, 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]

dataset
#&gt;       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#&gt; 1: 1 2 3 0     1     1     1    NA    NA
#&gt; 2:   1 5 0     1    NA    NA    NA     1
#&gt; 3:   1 2 0     1     1    NA    NA    NA
#&gt; 4:     1 0     1    NA    NA    NA    NA
#&gt; 5: 1 2 4 0     1     1    NA     1    NA

答案3

得分: 1

In cases when we have numbers instead of digits, grepl would match 1 and 11 the same way. To avoid this we could split (tstrsplit) on space, reshape wide-to-long (melt) and reshape it back again long-to-wide (dcast) with fun.aggregate, see example:

#example data with 11 and 23
d &lt;- data.table(char = c(&#39;1 2 3 0&#39;, 
                         &#39;11 5 0&#39;, 
                         &#39;1 23 0&#39;, 
                         &#39;1 0&#39;,
                         &#39;1 2 4 0&#39;))

# get number max of columns
colMax &lt;- max(stringr::str_count(d$char, &quot; &quot;)) + 1

d[, paste0(&quot;c&quot;, seq.int(colMax)) := tstrsplit(char, split = &quot; &quot;, type.convert = TRUE) 
        ][, melt(.SD, id.vars = &quot;char&quot;) 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]

#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0

英文:

#example data with 11 and 23
d &lt;- data.table(char = c(&#39;1 2 3 0&#39;, 
                         &#39;11 5 0&#39;, 
                         &#39;1 23 0&#39;, 
                         &#39;1 0&#39;,
                         &#39;1 2 4 0&#39;))

# get number max of columns
colMax &lt;- max(stringr::str_count(d$char, &quot; &quot;)) + 1

d[, paste0(&quot;c&quot;, seq.int(colMax)) := tstrsplit(char, split = &quot; &quot;, type.convert = TRUE) 
        ][, melt(.SD, id.vars = &quot;char&quot;) 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]

#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0

答案4

得分: 0

这是一个 Base R 的解决方案，如果你的 data.frame 很大，你可以使用包 parallel 和 parallel::parlapply 替代外部的 lapply。

# 我使用一个普通的数据框
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))

# 从内到外，首先我们将字符串按空格分割，然后转换为数值，接着匹配从0到5的所有数字，从而得到新数据框中每个数字的列
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), function(numbers) {
  
  # 使用match函数来查看0到5中哪些数字在相应的行中
  seq(0, 5) %in% numbers
  
})) -> matched_res

# 修复列名
colnames(matched_res) <- paste0("ind_", 0:5)

# 绑定
cbind(dataset, matched_res)

#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE

希望这对你有所帮助。

英文:

This is a Base R solution, if your data.frame is very large you can use the package parallel and parallel::parlapply instead of the outer lapply.

# I use a normal data frame instead
dataset &lt;- data.frame(&#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;))

# reading from inside out, we first split the strings on whitespaces convert to 
# numeric and then match all digits from 0:5 thus obtaining a column for ever digit 
# in our new df
do.call(rbind, lapply(sapply(strsplit(dataset[, &quot;char&quot;], &quot;\\s&quot;), as.numeric), \(numbers){
  
  # use match to see which of th edigits 0:5 is in the respective row
  seq(0, 5) %in% numbers
  
  
})) -&gt; matched_res

# fix colnames 
colnames(matched_res) &lt;- paste0(&quot;ind_&quot;, 0:5)

# bind
cbind(dataset, matched_res)

#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE

答案5

得分: -1

A Tidyverse approach just for the record (not trying to compete in terms of speed here...):

library(tidyverse)

df &lt;- tibble(&#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;))

df |&gt; 
  mutate(row = row_number(), .before = everything()) |&gt; 
  separate_longer_delim(char, delim = &quot; &quot;) |&gt; 
  arrange(char) |&gt; 
  pivot_wider(
    names_from = char, 
    names_prefix = &quot;Idx_&quot;,
    values_from = char, 
    values_fn = \(x) 1
  ) |&gt; 
  select(!row) |&gt; 
  mutate(char = df$char, .before = everything())
#&gt; # A tibble: 5 &#215; 7
#&gt;   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#&gt;   &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3 0     1     1     1     1    NA    NA
#&gt; 2 1 5 0       1     1    NA    NA    NA     1
#&gt; 3 1 2 0       1     1     1    NA    NA    NA
#&gt; 4 1 0         1     1    NA    NA    NA    NA
#&gt; 5 1 2 4 0     1     1     1    NA     1    NA

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

英文:

A Tidyverse approach just for the record (not trying to compete in terms of speed here...):

library(tidyverse)

df &lt;- tibble(&#39;char&#39; = c(&#39;1 2 3 0&#39;, &#39;1 5 0&#39;, &#39;1 2 0&#39;, &#39;1 0&#39;, &#39;1 2 4 0&#39;))

df |&gt; 
  mutate(row = row_number(), .before = everything()) |&gt; 
  separate_longer_delim(char, delim = &quot; &quot;) |&gt; 
  arrange(char) |&gt; 
  pivot_wider(
    names_from = char, 
    names_prefix = &quot;Idx_&quot;,
    values_from = char, 
    values_fn = \(x) 1
  ) |&gt; 
  select(!row) |&gt; 
  mutate(char = df$char, .before = everything())
#&gt; # A tibble: 5 &#215; 7
#&gt;   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#&gt;   &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3 0     1     1     1     1    NA    NA
#&gt; 2 1 5 0       1     1    NA    NA    NA     1
#&gt; 3 1 2 0       1     1     1    NA    NA    NA
#&gt; 4 1 0         1     1    NA    NA    NA    NA
#&gt; 5 1 2 4 0     1     1     1    NA     1    NA

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有办法避免在这里使用for循环？

问题

答案1

答案2

答案3

答案4

答案5

在R中，我如何测试两个包的数据集是否相同

在ggplot2中显示点与线的图例

如何正确获取并使用R库（nflplotR）中的数据（标志）？

如何在这个数组代码中使用for循环？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论