Loop function on timeseries works on small df, but not in large df – Error: C stack usage…too close to the limit

huangapple go评论130阅读模式
英文:

Loop function on timeseries works on small df, but not in large df - Error: C stack usage...too close to the limit

问题

I have a dataframe with dates/times (time series), site (grouping var) and value. I have identified the start times of different 'surges' - defined as changes in values of >=2 in 15 mins. For each surge time, I am trying to find the date/time where the value falls back down to (or below) the start of the surge (i.e., the end of the surge).

I can achieve this by using a recursive loop function ('find.next.smaller' from this question - https://stackoverflow.com/questions/38207584/in-a-dataframe-find-the-index-of-the-next-smaller-value-for-each-element-of-a-c). This works perfectly on a smaller dataframe but not on a large one. I get the error message "Error: C stack usage 15925584 is too close to the limit." Having seen other similar questions (e.g., https://stackoverflow.com/questions/14719349/error-c-stack-usage-is-too-close-to-the-limit), I do not think it's a problem of an infinite recursive function but a memory issue. But I do not know how to use shell (or powershell) to do this. I wondered whether there was any other way? Either through adapting my memory or the function below?

Some example code:

  1. ###df formatting
  2. library(dplyr)
  3. df <- data.frame("Date_time" = seq(from=as.POSIXct("2022-01-01 00:00"), by= 15*60, to=as.POSIXct("2022-01-01 07:00")),
  4. "Site" = rep(c("Site A", "Site B"), each = 29),
  5. "Value" = c(10,10.1,10.2,10.3,12.5,14.8,12.4,11.3,10.3,10.1,10.2,10.5,10.4,10.3,14.7,10.1,
  6. 16.7,16.3,16.4,14.2,10.2,10.1,10.3,10.2,11.7,13.2,13.2,11.1,11.4,
  7. rep(10.3, times=29)))
  8. df <- df %>% group_by(Site) %>% mutate(Lead_Value = lead(Value))
  9. df$Surge_start <- NA
  10. df[which(df$Lead_Value - df$Value >= 2),"Surge_start"] <-
  11. paste("Surge", seq(1,length(which(df$Lead_Value - df$Value >= 2)),1), sep="")
  12. ###Applying the 'find.next.smaller' function
  13. find.next.smaller <- function(ini = 1, vec) {
  14. if(length(vec) == 1) NA
  15. else c(ini + min(which(vec[1] >= vec[-1])),
  16. find.next.smaller(ini + 1, vec[-1]))
  17. } # the recursive function will go element by element through the vector and find out
  18. # the index of the next smaller value.
  19. df$Date_time <- as.character(df$Date_time)
  20. Output <- df %>% group_by(Site) %>% mutate(Surge_end = ifelse(grepl("Surge",Surge_start),Date_time[find.next.smaller(1, Value)],NA))
  21. ###This works fine
  22. df2 <- do.call("rbind", replicate(1000, df, simplify = FALSE))
  23. Output2 <- df2 %>% group_by(Site) %>% mutate(Surge_end = ifelse(grepl("Surge",Surge_start),Date_time[find.next.smaller(1, Value)],NA))
  24. ####This does not work
英文:

I have a dataframe with dates/times (time series), site (grouping var) and value. I have identified the start times of different 'surges' - defined as changes in values of >=2 in 15 mins. For each surge time, I am trying for the date/time where the value falls back down to (or below) the start of the surge (i.e., the end of the surge).

I can achieve this by using a recursive loop function ('find.next.smaller' from this question - https://stackoverflow.com/questions/38207584/in-a-dataframe-find-the-index-of-the-next-smaller-value-for-each-element-of-a-c). This works perfectly on a smaller dataframe, but not a large one. I get the error message "Error: C stack usage 15925584 is too close to the limit". Having seen other similar questions (e.g., https://stackoverflow.com/questions/14719349/error-c-stack-usage-is-too-close-to-the-limit), I do not think its a problem of an infinite recursive function, but a memory issue. But I do not know how to use shell (or powershell) to do this. I wondered whether there was any other way? Either through adapting my memory or the function below?

Some example code:

  1. ###df formatting
  2. library(dplyr)
  3. df &lt;- data.frame(&quot;Date_time&quot; =seq(from=as.POSIXct(&quot;2022-01-01 00:00&quot;) , by= 15*60, to=as.POSIXct(&quot;2022-01-01 07:00&quot;)),
  4. &quot;Site&quot; = rep(c(&quot;Site A&quot;, &quot;Site B&quot;), each = 29),
  5. &quot;Value&quot; = c(10,10.1,10.2,10.3,12.5,14.8,12.4,11.3,10.3,10.1,10.2,10.5,10.4,10.3,14.7,10.1,
  6. 16.7,16.3,16.4,14.2,10.2,10.1,10.3,10.2,11.7,13.2,13.2,11.1,11.4,
  7. rep(10.3,times=29)))
  8. df &lt;- df %&gt;% group_by(Site) %&gt;% mutate(Lead_Value = lead(Value))
  9. df$Surge_start &lt;- NA
  10. df[which(df$Lead_Value - df$Value &gt;=2),&quot;Surge_start&quot;] &lt;-
  11. paste(&quot;Surge&quot;,seq(1,length(which(df$Lead_Value - df$Value &gt;=2)),1),sep=&quot;&quot;)
  12. ###Applying the &#39;find.next.smaller&#39; function
  13. find.next.smaller &lt;- function(ini = 1, vec) {
  14. if(length(vec) == 1) NA
  15. else c(ini + min(which(vec[1] &gt;= vec[-1])),
  16. find.next.smaller(ini + 1, vec[-1]))
  17. } # the recursive function will go element by element through the vector and find out
  18. # the index of the next smaller value.
  19. df$Date_time &lt;- as.character(df$Date_time)
  20. Output &lt;- df %&gt;% group_by(Site) %&gt;% mutate(Surge_end = ifelse(grepl(&quot;Surge&quot;,Surge_start),Date_time[find.next.smaller(1, Value)],NA))
  21. ###This works fine
  22. df2 &lt;- do.call(&quot;rbind&quot;, replicate(1000, df, simplify = FALSE))
  23. Output2 &lt;- df2 %&gt;% group_by(Site) %&gt;% mutate(Surge_end = ifelse(grepl(&quot;Surge&quot;,Surge_start),Date_time[find.next.smaller(1, Value)],NA))
  24. ####This does not work

答案1

得分: 1

I suggest you don't need recursion.

  1. find_nearest_value <- function(surge, time1, val1, times, vals) {
  2. if (!grepl("Surge", surge)) NA else times[times > time1 & vals <= val1][1]
  3. }
  4. Output %>%
  5. group_by(Site) %>%
  6. mutate(end2 = if_else(grepl("Surge", Surge_start), mapply(find_nearest_value, Surge_start, Date_time, Value, list(Date_time), list(Value)), NA)) %>%
  7. print(n=99)
  8. # # A tibble: 58 × 7
  9. # # Groups: Site [2]
  10. # Date_time Site Value Lead_Value Surge_start Surge_end end2
  11. # <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
  12. # 1 2022-01-01 00:00:00 Site A 10 10.1 NA NA NA
  13. # 2 2022-01-01 00:15:00 Site A 10.1 10.2 NA NA NA
  14. # 3 2022-01-01 00:30:00 Site A 10.2 10.3 NA NA NA
  15. # 4 2022-01-01 00:45:00 Site A 10.3 12.5 Surge1 2022-01-01 02:00:00 2022-01-01 02:00:00
  16. # 5 2022-01-01 01:00:00 Site A 12.5 14.8 Surge2 2022-01-01 01:30:00 2022-01-01 01:30:00
  17. # 6 2022-01-01 01:15:00 Site A 14.8 12.4 NA NA NA
  18. # 7 2022-01-01 01:30:00 Site A 12.4 11.3 NA NA NA
  19. # 8 2022-01-01 01:45:00 Site A 11.3 10.3 NA NA NA
  20. # 9 2022-01-01 02:00:00 Site A 10.3 10.1 NA NA NA
  21. # 10 2022-01-01 02:15:00 Site A 10.1 10.2 NA NA NA
  22. # 11 2022-01-01 02:30:00 Site A 10.2 10.5 NA NA NA
  23. # 12 2022-01-01 02:45:00 Site A 10.5 10.4 NA NA NA
  24. # 13 2022-01-01 03:00:00 Site A 10.4 10.3 NA NA NA
  25. # 14 2022-01-01 03:15:00 Site A 10.3 14.7 Surge3 2022-01-01 03:45:00 2022-01-01 03:45:00
  26. # 15 2022-01-01 03:30:00 Site A 14.7 10.1 NA NA NA
  27. # 16 2022-01-01 03:45:00 Site A 10.1 16.7 Surge4 2022-01-01 05:15:00 2022-01-01 05:15:00
  28. # 17 2022-01-01 04:00:00 Site A 16.7 16.3 NA NA NA
  29. # 18 2022-01-01 04:15:00 Site A 16.3 16.4 NA NA NA
  30. # 19 2022-01-01 04:30:00 Site A 16.4 14.2 NA NA NA
  31. # 20 2022-01-01 04:45:00 Site A 14.2 10.2 NA NA NA
  32. # 21 2022-01-01 05:00:00 Site A 10.2 10.1 NA NA NA
  33. # 22 2022-01-01 05:15:00 Site A 10.1 10.3 NA NA NA
  34. # 23 2022-01-01 05:30:00 Site A 10.3 10.2 NA NA NA
  35. # 24 2022-01-01 05:45:00 Site A 10.2 11.7 NA NA NA
  36. # 25 2022-01-01 06:00:00 Site A 11.7 13.2 NA NA NA
  37. # 26 2022-01-01 06:15:00 Site A 13.2 13.2 NA NA NA
  38. # 27 2022-01-01 06:30:00 Site A 13.2 11.1 NA NA NA
  39. # 28 2022-01-01 06:45:00 Site A 11.1 11.4 NA NA NA
  40. # 29 2022-01-01 07:00:00 Site A 11.4 NA NA NA NA
  41. # 30 2022-01-01 00:00:00 Site B 10.3 10.3 NA NA NA
  42. # 31 2022-01-01 00:15:00 Site B 10.3 10.3 NA NA NA
  43. # 32 2022-01-01 00:30:00 Site B 10.3 10.3 NA NA NA
  44. # 33 2022-01-01 00:45:00 Site B 10.3 10.3 NA NA NA
  45. # 34 2022-01-01 01:00:00 Site B 10.3 10.3 NA NA NA
  46. # 35 2022-01-01 01:15:00 Site B 10.3 10.3 NA NA NA
  47. # 36 2022-01-01 01:30:00 Site B 10.3 10.3 NA NA NA
  48. # 37 2022-01-01 01:45:00 Site B 10.3
  49. <details>
  50. <summary>英文:</summary>
  51. I suggest you don&#39;t need recursion.
  52. ```r
  53. find_nearest_value &lt;- function(surge, time1, val1, times, vals) {
  54. if (!grepl(&quot;Surge&quot;, surge)) NA else times[times &gt; time1 &amp; vals &lt;= val1][1]
  55. }
  56. Output %&gt;%
  57. group_by(Site) %&gt;%
  58. mutate(end2 = if_else(grepl(&quot;Surge&quot;, Surge_start), mapply(find_nearest_value, Surge_start, Date_time, Value, list(Date_time), list(Value)), NA)) %&gt;%
  59. print(n=99)
  60. # # A tibble: 58 &#215; 7
  61. # # Groups: Site [2]
  62. # Date_time Site Value Lead_Value Surge_start Surge_end end2
  63. # &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  64. # 1 2022-01-01 00:00:00 Site A 10 10.1 NA NA NA
  65. # 2 2022-01-01 00:15:00 Site A 10.1 10.2 NA NA NA
  66. # 3 2022-01-01 00:30:00 Site A 10.2 10.3 NA NA NA
  67. # 4 2022-01-01 00:45:00 Site A 10.3 12.5 Surge1 2022-01-01 02:00:00 2022-01-01 02:00:00
  68. # 5 2022-01-01 01:00:00 Site A 12.5 14.8 Surge2 2022-01-01 01:30:00 2022-01-01 01:30:00
  69. # 6 2022-01-01 01:15:00 Site A 14.8 12.4 NA NA NA
  70. # 7 2022-01-01 01:30:00 Site A 12.4 11.3 NA NA NA
  71. # 8 2022-01-01 01:45:00 Site A 11.3 10.3 NA NA NA
  72. # 9 2022-01-01 02:00:00 Site A 10.3 10.1 NA NA NA
  73. # 10 2022-01-01 02:15:00 Site A 10.1 10.2 NA NA NA
  74. # 11 2022-01-01 02:30:00 Site A 10.2 10.5 NA NA NA
  75. # 12 2022-01-01 02:45:00 Site A 10.5 10.4 NA NA NA
  76. # 13 2022-01-01 03:00:00 Site A 10.4 10.3 NA NA NA
  77. # 14 2022-01-01 03:15:00 Site A 10.3 14.7 Surge3 2022-01-01 03:45:00 2022-01-01 03:45:00
  78. # 15 2022-01-01 03:30:00 Site A 14.7 10.1 NA NA NA
  79. # 16 2022-01-01 03:45:00 Site A 10.1 16.7 Surge4 2022-01-01 05:15:00 2022-01-01 05:15:00
  80. # 17 2022-01-01 04:00:00 Site A 16.7 16.3 NA NA NA
  81. # 18 2022-01-01 04:15:00 Site A 16.3 16.4 NA NA NA
  82. # 19 2022-01-01 04:30:00 Site A 16.4 14.2 NA NA NA
  83. # 20 2022-01-01 04:45:00 Site A 14.2 10.2 NA NA NA
  84. # 21 2022-01-01 05:00:00 Site A 10.2 10.1 NA NA NA
  85. # 22 2022-01-01 05:15:00 Site A 10.1 10.3 NA NA NA
  86. # 23 2022-01-01 05:30:00 Site A 10.3 10.2 NA NA NA
  87. # 24 2022-01-01 05:45:00 Site A 10.2 11.7 NA NA NA
  88. # 25 2022-01-01 06:00:00 Site A 11.7 13.2 NA NA NA
  89. # 26 2022-01-01 06:15:00 Site A 13.2 13.2 NA NA NA
  90. # 27 2022-01-01 06:30:00 Site A 13.2 11.1 NA NA NA
  91. # 28 2022-01-01 06:45:00 Site A 11.1 11.4 NA NA NA
  92. # 29 2022-01-01 07:00:00 Site A 11.4 NA NA NA NA
  93. # 30 2022-01-01 00:00:00 Site B 10.3 10.3 NA NA NA
  94. # 31 2022-01-01 00:15:00 Site B 10.3 10.3 NA NA NA
  95. # 32 2022-01-01 00:30:00 Site B 10.3 10.3 NA NA NA
  96. # 33 2022-01-01 00:45:00 Site B 10.3 10.3 NA NA NA
  97. # 34 2022-01-01 01:00:00 Site B 10.3 10.3 NA NA NA
  98. # 35 2022-01-01 01:15:00 Site B 10.3 10.3 NA NA NA
  99. # 36 2022-01-01 01:30:00 Site B 10.3 10.3 NA NA NA
  100. # 37 2022-01-01 01:45:00 Site B 10.3 10.3 NA NA NA
  101. # 38 2022-01-01 02:00:00 Site B 10.3 10.3 NA NA NA
  102. # 39 2022-01-01 02:15:00 Site B 10.3 10.3 NA NA NA
  103. # 40 2022-01-01 02:30:00 Site B 10.3 10.3 NA NA NA
  104. # 41 2022-01-01 02:45:00 Site B 10.3 10.3 NA NA NA
  105. # 42 2022-01-01 03:00:00 Site B 10.3 10.3 NA NA NA
  106. # 43 2022-01-01 03:15:00 Site B 10.3 10.3 NA NA NA
  107. # 44 2022-01-01 03:30:00 Site B 10.3 10.3 NA NA NA
  108. # 45 2022-01-01 03:45:00 Site B 10.3 10.3 NA NA NA
  109. # 46 2022-01-01 04:00:00 Site B 10.3 10.3 NA NA NA
  110. # 47 2022-01-01 04:15:00 Site B 10.3 10.3 NA NA NA
  111. # 48 2022-01-01 04:30:00 Site B 10.3 10.3 NA NA NA
  112. # 49 2022-01-01 04:45:00 Site B 10.3 10.3 NA NA NA
  113. # 50 2022-01-01 05:00:00 Site B 10.3 10.3 NA NA NA
  114. # 51 2022-01-01 05:15:00 Site B 10.3 10.3 NA NA NA
  115. # 52 2022-01-01 05:30:00 Site B 10.3 10.3 NA NA NA
  116. # 53 2022-01-01 05:45:00 Site B 10.3 10.3 NA NA NA
  117. # 54 2022-01-01 06:00:00 Site B 10.3 10.3 NA NA NA
  118. # 55 2022-01-01 06:15:00 Site B 10.3 10.3 NA NA NA
  119. # 56 2022-01-01 06:30:00 Site B 10.3 10.3 NA NA NA
  120. # 57 2022-01-01 06:45:00 Site B 10.3 10.3 NA NA NA
  121. # 58 2022-01-01 07:00:00 Site B 10.3 NA NA NA NA

答案2

得分: 1

以下是翻译好的内容:

可能递归使用了太多内存,你可能最好使用矢量化/循环的方法,即使需要花费更多时间。下面我对你的函数进行了修改并创建了一些选项。

一些选项

原始代码:

  1. find.next.smaller_rec <- function(ini = 1, vec) {
  2. if(length(vec) == 1) NA
  3. else c(ini + min(which(vec[1] >= vec[-1])),
  4. find.next.smaller_rec(ini + 1, vec[-1]))
  5. }

用于矢量化的基本构建块:

  1. find.next.smaller <- function(val, vec) {
  2. if(val == length(vec)) NA else val + min(which(vec[val] >= vec[-(1:val)]))
  3. }

使用for循环:

  1. find.next.smaller_for <- function(x, vec){
  2. result <- numeric(x)
  3. for(val in 1:x){
  4. result[val] <- find.next.smaller(val, vec)
  5. }
  6. result
  7. }

使用Vectorize()函数:

  1. find.next.smaller_vec <- Vectorize(find.next.smaller, "val")

使用purrr::map函数:

  1. find.next.smaller_map <- function(x, vec){
  2. map_dbl(1:x, ~ find.next.smaller(val = .x, vec = vec))
  3. }

比较:

  1. bench <- bench::mark(find.next.smaller_rec(1, df$Value),
  2. find.next.smaller_for(nrow(df), df$Value),
  3. find.next.smaller_vec(1:nrow(df), df$Value),
  4. find.next.smaller_map(nrow(df), df$Value),
  5. min_time = 2)
  6. bench %>% select(c(median, mem_alloc, n_gc, `gc/sec`))
  7. median mem_alloc n_gc `gc/sec`
  8. <bch:tm> <bch:byt> <dbl> <dbl>
  9. 1 496µs 92.4KB 13 7.30
  10. 2 582µs 77.1KB 10 5.46
  11. 3 612µs 78.7KB 10 5.97
  12. 4 681µs 77.1KB 10 5.40

我们可以看到,即使它更快,递归使用了更多内存,这可能是导致错误的原因。

可能还有更好的选项,我只是想呈现与您原始选项类似的一些选项。

将它们应用到问题上

  1. Output <- df %>%
  2. group_by(Site) %>%
  3. mutate(Surge_end = ifelse(grepl("Surge",Surge_start),
  4. Date_time[find.next.smaller_for(n(), Value)],
  5. NA_character_))

您还可以使用Date_time[find.next.smaller_map(n(), Value)]Date_time[find.next.smaller_vec(1:n(), Value)]

英文:

Possibly the recursion uses too much memory, and you're probably better of with a vectorized/looped approach, even if it takes a bit longer. Below I made an alteration to your function and created some options.

Some options

Original:

  1. find.next.smaller_rec &lt;- function(ini = 1, vec) {
  2. if(length(vec) == 1) NA
  3. else c(ini + min(which(vec[1] &gt;= vec[-1])),
  4. find.next.smaller_rec(ini + 1, vec[-1]))
  5. }

The building block for the vectorized ones:

  1. find.next.smaller &lt;- function(val, vec) {
  2. if(val == length(vec)) NA else val + min(which(vec[val] &gt;= vec[-(1:val)]))
  3. }

With a for loop:

  1. find.next.smaller_for &lt;- function(x, vec){
  2. result &lt;- numeric(x)
  3. for(val in 1:x){
  4. result[val] &lt;- find.next.smaller(val, vec)
  5. }
  6. result
  7. }

With Vectorize():

  1. find.next.smaller_vec &lt;- Vectorize(find.next.smaller, &quot;val&quot;)

With purrr::map:

  1. find.next.smaller_map &lt;- function(x, vec){
  2. map_dbl(1:x, ~ find.next.smaller(val = .x, vec = vec))
  3. }

Comparison:

  1. bench &lt;- bench::mark(find.next.smaller_rec(1, df$Value),
  2. find.next.smaller_for(nrow(df), df$Value),
  3. find.next.smaller_vec(1:nrow(df), df$Value),
  4. find.next.smaller_map(nrow(df), df$Value),
  5. min_time = 2)
  6. bench %&gt;% select(c(median, mem_alloc, n_gc, `gc/sec`))
  7. median mem_alloc n_gc `gc/sec`
  8. &lt;bch:tm&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;dbl&gt;
  9. 1 496&#181;s 92.4KB 13 7.30
  10. 2 582&#181;s 77.1KB 10 5.46
  11. 3 612&#181;s 78.7KB 10 5.97
  12. 4 681&#181;s 77.1KB 10 5.40

We can see that, even if it's faster, the recursion uses more memory, and this might be the reason for your error.

There probably are even better options, I just wanted to present ones that were similar to your original one.

Applying them to the problem

  1. Output &lt;- df %&gt;%
  2. group_by(Site) %&gt;%
  3. mutate(Surge_end = ifelse(grepl(&quot;Surge&quot;,Surge_start),
  4. Date_time[find.next.smaller_for(n(), Value)],
  5. NA_character_))

Where you can also use Date_time[find.next.smaller_map(n(), Value)] or Date_time[find.next.smaller_vec(1:n(), Value)].

huangapple
  • 本文由 发表于 2023年5月25日 19:22:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331735.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定