英文:
Loop function on timeseries works on small df, but not in large df - Error: C stack usage...too close to the limit
问题
I have a dataframe with dates/times (time series), site (grouping var) and value. I have identified the start times of different 'surges' - defined as changes in values of >=2 in 15 mins. For each surge time, I am trying to find the date/time where the value falls back down to (or below) the start of the surge (i.e., the end of the surge).
I can achieve this by using a recursive loop function ('find.next.smaller' from this question - https://stackoverflow.com/questions/38207584/in-a-dataframe-find-the-index-of-the-next-smaller-value-for-each-element-of-a-c). This works perfectly on a smaller dataframe but not on a large one. I get the error message "Error: C stack usage 15925584 is too close to the limit." Having seen other similar questions (e.g., https://stackoverflow.com/questions/14719349/error-c-stack-usage-is-too-close-to-the-limit), I do not think it's a problem of an infinite recursive function but a memory issue. But I do not know how to use shell (or powershell) to do this. I wondered whether there was any other way? Either through adapting my memory or the function below?
Some example code:
###df formatting
library(dplyr)
df <- data.frame("Date_time" = seq(from=as.POSIXct("2022-01-01 00:00"), by= 15*60, to=as.POSIXct("2022-01-01 07:00")),
"Site" = rep(c("Site A", "Site B"), each = 29),
"Value" = c(10,10.1,10.2,10.3,12.5,14.8,12.4,11.3,10.3,10.1,10.2,10.5,10.4,10.3,14.7,10.1,
16.7,16.3,16.4,14.2,10.2,10.1,10.3,10.2,11.7,13.2,13.2,11.1,11.4,
rep(10.3, times=29)))
df <- df %>% group_by(Site) %>% mutate(Lead_Value = lead(Value))
df$Surge_start <- NA
df[which(df$Lead_Value - df$Value >= 2),"Surge_start"] <-
paste("Surge", seq(1,length(which(df$Lead_Value - df$Value >= 2)),1), sep="")
###Applying the 'find.next.smaller' function
find.next.smaller <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] >= vec[-1])),
find.next.smaller(ini + 1, vec[-1]))
} # the recursive function will go element by element through the vector and find out
# the index of the next smaller value.
df$Date_time <- as.character(df$Date_time)
Output <- df %>% group_by(Site) %>% mutate(Surge_end = ifelse(grepl("Surge",Surge_start),Date_time[find.next.smaller(1, Value)],NA))
###This works fine
df2 <- do.call("rbind", replicate(1000, df, simplify = FALSE))
Output2 <- df2 %>% group_by(Site) %>% mutate(Surge_end = ifelse(grepl("Surge",Surge_start),Date_time[find.next.smaller(1, Value)],NA))
####This does not work
英文:
I have a dataframe with dates/times (time series), site (grouping var) and value. I have identified the start times of different 'surges' - defined as changes in values of >=2 in 15 mins. For each surge time, I am trying for the date/time where the value falls back down to (or below) the start of the surge (i.e., the end of the surge).
I can achieve this by using a recursive loop function ('find.next.smaller' from this question - https://stackoverflow.com/questions/38207584/in-a-dataframe-find-the-index-of-the-next-smaller-value-for-each-element-of-a-c). This works perfectly on a smaller dataframe, but not a large one. I get the error message "Error: C stack usage 15925584 is too close to the limit". Having seen other similar questions (e.g., https://stackoverflow.com/questions/14719349/error-c-stack-usage-is-too-close-to-the-limit), I do not think its a problem of an infinite recursive function, but a memory issue. But I do not know how to use shell (or powershell) to do this. I wondered whether there was any other way? Either through adapting my memory or the function below?
Some example code:
###df formatting
library(dplyr)
df <- data.frame("Date_time" =seq(from=as.POSIXct("2022-01-01 00:00") , by= 15*60, to=as.POSIXct("2022-01-01 07:00")),
"Site" = rep(c("Site A", "Site B"), each = 29),
"Value" = c(10,10.1,10.2,10.3,12.5,14.8,12.4,11.3,10.3,10.1,10.2,10.5,10.4,10.3,14.7,10.1,
16.7,16.3,16.4,14.2,10.2,10.1,10.3,10.2,11.7,13.2,13.2,11.1,11.4,
rep(10.3,times=29)))
df <- df %>% group_by(Site) %>% mutate(Lead_Value = lead(Value))
df$Surge_start <- NA
df[which(df$Lead_Value - df$Value >=2),"Surge_start"] <-
paste("Surge",seq(1,length(which(df$Lead_Value - df$Value >=2)),1),sep="")
###Applying the 'find.next.smaller' function
find.next.smaller <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] >= vec[-1])),
find.next.smaller(ini + 1, vec[-1]))
} # the recursive function will go element by element through the vector and find out
# the index of the next smaller value.
df$Date_time <- as.character(df$Date_time)
Output <- df %>% group_by(Site) %>% mutate(Surge_end = ifelse(grepl("Surge",Surge_start),Date_time[find.next.smaller(1, Value)],NA))
###This works fine
df2 <- do.call("rbind", replicate(1000, df, simplify = FALSE))
Output2 <- df2 %>% group_by(Site) %>% mutate(Surge_end = ifelse(grepl("Surge",Surge_start),Date_time[find.next.smaller(1, Value)],NA))
####This does not work
答案1
得分: 1
I suggest you don't need recursion.
find_nearest_value <- function(surge, time1, val1, times, vals) {
if (!grepl("Surge", surge)) NA else times[times > time1 & vals <= val1][1]
}
Output %>%
group_by(Site) %>%
mutate(end2 = if_else(grepl("Surge", Surge_start), mapply(find_nearest_value, Surge_start, Date_time, Value, list(Date_time), list(Value)), NA)) %>%
print(n=99)
# # A tibble: 58 × 7
# # Groups: Site [2]
# Date_time Site Value Lead_Value Surge_start Surge_end end2
# <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
# 1 2022-01-01 00:00:00 Site A 10 10.1 NA NA NA
# 2 2022-01-01 00:15:00 Site A 10.1 10.2 NA NA NA
# 3 2022-01-01 00:30:00 Site A 10.2 10.3 NA NA NA
# 4 2022-01-01 00:45:00 Site A 10.3 12.5 Surge1 2022-01-01 02:00:00 2022-01-01 02:00:00
# 5 2022-01-01 01:00:00 Site A 12.5 14.8 Surge2 2022-01-01 01:30:00 2022-01-01 01:30:00
# 6 2022-01-01 01:15:00 Site A 14.8 12.4 NA NA NA
# 7 2022-01-01 01:30:00 Site A 12.4 11.3 NA NA NA
# 8 2022-01-01 01:45:00 Site A 11.3 10.3 NA NA NA
# 9 2022-01-01 02:00:00 Site A 10.3 10.1 NA NA NA
# 10 2022-01-01 02:15:00 Site A 10.1 10.2 NA NA NA
# 11 2022-01-01 02:30:00 Site A 10.2 10.5 NA NA NA
# 12 2022-01-01 02:45:00 Site A 10.5 10.4 NA NA NA
# 13 2022-01-01 03:00:00 Site A 10.4 10.3 NA NA NA
# 14 2022-01-01 03:15:00 Site A 10.3 14.7 Surge3 2022-01-01 03:45:00 2022-01-01 03:45:00
# 15 2022-01-01 03:30:00 Site A 14.7 10.1 NA NA NA
# 16 2022-01-01 03:45:00 Site A 10.1 16.7 Surge4 2022-01-01 05:15:00 2022-01-01 05:15:00
# 17 2022-01-01 04:00:00 Site A 16.7 16.3 NA NA NA
# 18 2022-01-01 04:15:00 Site A 16.3 16.4 NA NA NA
# 19 2022-01-01 04:30:00 Site A 16.4 14.2 NA NA NA
# 20 2022-01-01 04:45:00 Site A 14.2 10.2 NA NA NA
# 21 2022-01-01 05:00:00 Site A 10.2 10.1 NA NA NA
# 22 2022-01-01 05:15:00 Site A 10.1 10.3 NA NA NA
# 23 2022-01-01 05:30:00 Site A 10.3 10.2 NA NA NA
# 24 2022-01-01 05:45:00 Site A 10.2 11.7 NA NA NA
# 25 2022-01-01 06:00:00 Site A 11.7 13.2 NA NA NA
# 26 2022-01-01 06:15:00 Site A 13.2 13.2 NA NA NA
# 27 2022-01-01 06:30:00 Site A 13.2 11.1 NA NA NA
# 28 2022-01-01 06:45:00 Site A 11.1 11.4 NA NA NA
# 29 2022-01-01 07:00:00 Site A 11.4 NA NA NA NA
# 30 2022-01-01 00:00:00 Site B 10.3 10.3 NA NA NA
# 31 2022-01-01 00:15:00 Site B 10.3 10.3 NA NA NA
# 32 2022-01-01 00:30:00 Site B 10.3 10.3 NA NA NA
# 33 2022-01-01 00:45:00 Site B 10.3 10.3 NA NA NA
# 34 2022-01-01 01:00:00 Site B 10.3 10.3 NA NA NA
# 35 2022-01-01 01:15:00 Site B 10.3 10.3 NA NA NA
# 36 2022-01-01 01:30:00 Site B 10.3 10.3 NA NA NA
# 37 2022-01-01 01:45:00 Site B 10.3
<details>
<summary>英文:</summary>
I suggest you don't need recursion.
```r
find_nearest_value <- function(surge, time1, val1, times, vals) {
if (!grepl("Surge", surge)) NA else times[times > time1 & vals <= val1][1]
}
Output %>%
group_by(Site) %>%
mutate(end2 = if_else(grepl("Surge", Surge_start), mapply(find_nearest_value, Surge_start, Date_time, Value, list(Date_time), list(Value)), NA)) %>%
print(n=99)
# # A tibble: 58 × 7
# # Groups: Site [2]
# Date_time Site Value Lead_Value Surge_start Surge_end end2
# <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
# 1 2022-01-01 00:00:00 Site A 10 10.1 NA NA NA
# 2 2022-01-01 00:15:00 Site A 10.1 10.2 NA NA NA
# 3 2022-01-01 00:30:00 Site A 10.2 10.3 NA NA NA
# 4 2022-01-01 00:45:00 Site A 10.3 12.5 Surge1 2022-01-01 02:00:00 2022-01-01 02:00:00
# 5 2022-01-01 01:00:00 Site A 12.5 14.8 Surge2 2022-01-01 01:30:00 2022-01-01 01:30:00
# 6 2022-01-01 01:15:00 Site A 14.8 12.4 NA NA NA
# 7 2022-01-01 01:30:00 Site A 12.4 11.3 NA NA NA
# 8 2022-01-01 01:45:00 Site A 11.3 10.3 NA NA NA
# 9 2022-01-01 02:00:00 Site A 10.3 10.1 NA NA NA
# 10 2022-01-01 02:15:00 Site A 10.1 10.2 NA NA NA
# 11 2022-01-01 02:30:00 Site A 10.2 10.5 NA NA NA
# 12 2022-01-01 02:45:00 Site A 10.5 10.4 NA NA NA
# 13 2022-01-01 03:00:00 Site A 10.4 10.3 NA NA NA
# 14 2022-01-01 03:15:00 Site A 10.3 14.7 Surge3 2022-01-01 03:45:00 2022-01-01 03:45:00
# 15 2022-01-01 03:30:00 Site A 14.7 10.1 NA NA NA
# 16 2022-01-01 03:45:00 Site A 10.1 16.7 Surge4 2022-01-01 05:15:00 2022-01-01 05:15:00
# 17 2022-01-01 04:00:00 Site A 16.7 16.3 NA NA NA
# 18 2022-01-01 04:15:00 Site A 16.3 16.4 NA NA NA
# 19 2022-01-01 04:30:00 Site A 16.4 14.2 NA NA NA
# 20 2022-01-01 04:45:00 Site A 14.2 10.2 NA NA NA
# 21 2022-01-01 05:00:00 Site A 10.2 10.1 NA NA NA
# 22 2022-01-01 05:15:00 Site A 10.1 10.3 NA NA NA
# 23 2022-01-01 05:30:00 Site A 10.3 10.2 NA NA NA
# 24 2022-01-01 05:45:00 Site A 10.2 11.7 NA NA NA
# 25 2022-01-01 06:00:00 Site A 11.7 13.2 NA NA NA
# 26 2022-01-01 06:15:00 Site A 13.2 13.2 NA NA NA
# 27 2022-01-01 06:30:00 Site A 13.2 11.1 NA NA NA
# 28 2022-01-01 06:45:00 Site A 11.1 11.4 NA NA NA
# 29 2022-01-01 07:00:00 Site A 11.4 NA NA NA NA
# 30 2022-01-01 00:00:00 Site B 10.3 10.3 NA NA NA
# 31 2022-01-01 00:15:00 Site B 10.3 10.3 NA NA NA
# 32 2022-01-01 00:30:00 Site B 10.3 10.3 NA NA NA
# 33 2022-01-01 00:45:00 Site B 10.3 10.3 NA NA NA
# 34 2022-01-01 01:00:00 Site B 10.3 10.3 NA NA NA
# 35 2022-01-01 01:15:00 Site B 10.3 10.3 NA NA NA
# 36 2022-01-01 01:30:00 Site B 10.3 10.3 NA NA NA
# 37 2022-01-01 01:45:00 Site B 10.3 10.3 NA NA NA
# 38 2022-01-01 02:00:00 Site B 10.3 10.3 NA NA NA
# 39 2022-01-01 02:15:00 Site B 10.3 10.3 NA NA NA
# 40 2022-01-01 02:30:00 Site B 10.3 10.3 NA NA NA
# 41 2022-01-01 02:45:00 Site B 10.3 10.3 NA NA NA
# 42 2022-01-01 03:00:00 Site B 10.3 10.3 NA NA NA
# 43 2022-01-01 03:15:00 Site B 10.3 10.3 NA NA NA
# 44 2022-01-01 03:30:00 Site B 10.3 10.3 NA NA NA
# 45 2022-01-01 03:45:00 Site B 10.3 10.3 NA NA NA
# 46 2022-01-01 04:00:00 Site B 10.3 10.3 NA NA NA
# 47 2022-01-01 04:15:00 Site B 10.3 10.3 NA NA NA
# 48 2022-01-01 04:30:00 Site B 10.3 10.3 NA NA NA
# 49 2022-01-01 04:45:00 Site B 10.3 10.3 NA NA NA
# 50 2022-01-01 05:00:00 Site B 10.3 10.3 NA NA NA
# 51 2022-01-01 05:15:00 Site B 10.3 10.3 NA NA NA
# 52 2022-01-01 05:30:00 Site B 10.3 10.3 NA NA NA
# 53 2022-01-01 05:45:00 Site B 10.3 10.3 NA NA NA
# 54 2022-01-01 06:00:00 Site B 10.3 10.3 NA NA NA
# 55 2022-01-01 06:15:00 Site B 10.3 10.3 NA NA NA
# 56 2022-01-01 06:30:00 Site B 10.3 10.3 NA NA NA
# 57 2022-01-01 06:45:00 Site B 10.3 10.3 NA NA NA
# 58 2022-01-01 07:00:00 Site B 10.3 NA NA NA NA
答案2
得分: 1
以下是翻译好的内容:
可能递归使用了太多内存,你可能最好使用矢量化/循环的方法,即使需要花费更多时间。下面我对你的函数进行了修改并创建了一些选项。
一些选项
原始代码:
find.next.smaller_rec <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] >= vec[-1])),
find.next.smaller_rec(ini + 1, vec[-1]))
}
用于矢量化的基本构建块:
find.next.smaller <- function(val, vec) {
if(val == length(vec)) NA else val + min(which(vec[val] >= vec[-(1:val)]))
}
使用for循环:
find.next.smaller_for <- function(x, vec){
result <- numeric(x)
for(val in 1:x){
result[val] <- find.next.smaller(val, vec)
}
result
}
使用Vectorize()
函数:
find.next.smaller_vec <- Vectorize(find.next.smaller, "val")
使用purrr::map
函数:
find.next.smaller_map <- function(x, vec){
map_dbl(1:x, ~ find.next.smaller(val = .x, vec = vec))
}
比较:
bench <- bench::mark(find.next.smaller_rec(1, df$Value),
find.next.smaller_for(nrow(df), df$Value),
find.next.smaller_vec(1:nrow(df), df$Value),
find.next.smaller_map(nrow(df), df$Value),
min_time = 2)
bench %>% select(c(median, mem_alloc, n_gc, `gc/sec`))
median mem_alloc n_gc `gc/sec`
<bch:tm> <bch:byt> <dbl> <dbl>
1 496µs 92.4KB 13 7.30
2 582µs 77.1KB 10 5.46
3 612µs 78.7KB 10 5.97
4 681µs 77.1KB 10 5.40
我们可以看到,即使它更快,递归使用了更多内存,这可能是导致错误的原因。
可能还有更好的选项,我只是想呈现与您原始选项类似的一些选项。
将它们应用到问题上
Output <- df %>%
group_by(Site) %>%
mutate(Surge_end = ifelse(grepl("Surge",Surge_start),
Date_time[find.next.smaller_for(n(), Value)],
NA_character_))
您还可以使用Date_time[find.next.smaller_map(n(), Value)]
或Date_time[find.next.smaller_vec(1:n(), Value)]
。
英文:
Possibly the recursion uses too much memory, and you're probably better of with a vectorized/looped approach, even if it takes a bit longer. Below I made an alteration to your function and created some options.
Some options
Original:
find.next.smaller_rec <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] >= vec[-1])),
find.next.smaller_rec(ini + 1, vec[-1]))
}
The building block for the vectorized ones:
find.next.smaller <- function(val, vec) {
if(val == length(vec)) NA else val + min(which(vec[val] >= vec[-(1:val)]))
}
With a for loop:
find.next.smaller_for <- function(x, vec){
result <- numeric(x)
for(val in 1:x){
result[val] <- find.next.smaller(val, vec)
}
result
}
With Vectorize()
:
find.next.smaller_vec <- Vectorize(find.next.smaller, "val")
With purrr::map
:
find.next.smaller_map <- function(x, vec){
map_dbl(1:x, ~ find.next.smaller(val = .x, vec = vec))
}
Comparison:
bench <- bench::mark(find.next.smaller_rec(1, df$Value),
find.next.smaller_for(nrow(df), df$Value),
find.next.smaller_vec(1:nrow(df), df$Value),
find.next.smaller_map(nrow(df), df$Value),
min_time = 2)
bench %>% select(c(median, mem_alloc, n_gc, `gc/sec`))
median mem_alloc n_gc `gc/sec`
<bch:tm> <bch:byt> <dbl> <dbl>
1 496µs 92.4KB 13 7.30
2 582µs 77.1KB 10 5.46
3 612µs 78.7KB 10 5.97
4 681µs 77.1KB 10 5.40
We can see that, even if it's faster, the recursion uses more memory, and this might be the reason for your error.
There probably are even better options, I just wanted to present ones that were similar to your original one.
Applying them to the problem
Output <- df %>%
group_by(Site) %>%
mutate(Surge_end = ifelse(grepl("Surge",Surge_start),
Date_time[find.next.smaller_for(n(), Value)],
NA_character_))
Where you can also use Date_time[find.next.smaller_map(n(), Value)]
or Date_time[find.next.smaller_vec(1:n(), Value)]
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论