找到R中每年的最大时间差

huangapple go评论75阅读模式
英文:

Find max time difference within each year in R

问题

这是您要翻译的内容:

"I have a function that calculates the average, min and max values for each year in my dataframe, then merges them to output the alltime average, min and max values. Each year needs to be calculated separately first because my dates only refer to the months of April through August. If I didn't group it by year, there would be calculations between August of one year and April of the next year. I want to avoid this.

Example dataframe:

date            NDVI        cloud_cover    field_id

23/04/2017      0.6494          12           KM60        
23/04/2017      0.5683          0            KM1
05/05/2017      0.3467          0            KM60
31/07/2017      0.6743          05           KM60
31/07/2017        NA            97           KM1
31/07/2017      0.3456          07           LM27
01/04/2018        NA            100          KM60
03/06/2018      0.6743          11           KM60
03/06/2018      0.2346          12           KM1
04/05/2019        NA            99           KM60
05/05/2019      0.5432          20           KM60

NDVI and cloud_cover shouldn't influence calculations. Although field_ids most times provide the same dates, this also shouldn't influence them.

This is the current code:

calculate_time_diff <- function(df) {

  # Convert "date" column to datetime
  df$date <- as.POSIXct(df$date)
  
  # Group the data by year
  df_calc <- split(df, format(df$date, "%Y"))
  
  # Calculate time differences between consecutive observations for each year
  time_diffs <- lapply(df_calc, function(group) {
    # Sort dataframe based on "date"
    group <- group[order(group$date), ]
    
    # Filter out duplicate dates
    group <- group[!duplicated(group$date), ]
    
    # Calculate time differences between consecutive observations
    diff(group$date)
  })
  
  # Combine time differences from all years into a single vector
  all_time_diffs <- unlist(time_diffs)
  
  # Compute average time difference
  avg_time_diff <- mean(all_time_diffs)
  
  # Calculate smallest and biggest time differences
  smallest_time_diff <- min(all_time_diffs)
  biggest_time_diff <- max(all_time_diffs)
  
  return(list(avg_time_diff = avg_time_diff,
              smallest_time_diff = smallest_time_diff,
              biggest_time_diff = biggest_time_diff))
}

The output is giving me "240" as max time difference, which I know to be unrealistic. My dataframe refers to the revisit dates of three satellites and none of them should be more than at the very most a month apart.

I thought it could have something to do with the way years are being extracted, but this user seems to have successfully used format() just as I did. lapply() should iterate through each split year group in the same way as group_by(). So what could be the problem in my script?"

英文:

I have a function that calculates the average, min and max values for each year in my dataframe, then merges them to output the alltime average, min and max values. Each year needs to be calculated separately first because my dates only refer to the months of April through August. If I didn't group it by year, there would be calculations between August of one year and April of the next year. I want to avoid this.

Example dataframe:

date            NDVI        cloud_cover    field_id

23/04/2017      0.6494          12           KM60        
23/04/2017      0.5683          0            KM1
05/05/2017      0.3467          0            KM60
31/07/2017      0.6743          05           KM60
31/07/2017        NA            97           KM1
31/07/2017      0.3456          07           LM27
01/04/2018        NA            100          KM60
03/06/2018      0.6743          11           KM60
03/06/2018      0.2346          12           KM1
04/05/2019        NA            99           KM60
05/05/2019      0.5432          20           KM60

NDVI and cloud_cover shouldn't influence calculations. Although field_ids most times provide the same dates, this also shouldn't influence them.

This is the current code:

calculate_time_diff &lt;- function(df) {

  # Convert &quot;date&quot; column to datetime
  df$date &lt;- as.POSIXct(df$date)
  
  # Group the data by year
  df_calc &lt;- split(df, format(df$date, &quot;%Y&quot;))
  
  # Calculate time differences between consecutive observations for each year
  time_diffs &lt;- lapply(df_calc, function(group) {
    # Sort dataframe based on &quot;date&quot;
    group &lt;- group[order(group$date), ]
    
    # Filter out duplicate dates
    group &lt;- group[!duplicated(group$date), ]
    
    # Calculate time differences between consecutive observations
    diff(group$date)
  })
  
  # Combine time differences from all years into a single vector
  all_time_diffs &lt;- unlist(time_diffs)
  
  # Compute average time difference
  avg_time_diff &lt;- mean(all_time_diffs)
  
  # Calculate smallest and biggest time differences
  smallest_time_diff &lt;- min(all_time_diffs)
  biggest_time_diff &lt;- max(all_time_diffs)
  
  return(list(avg_time_diff = avg_time_diff,
              smallest_time_diff = smallest_time_diff,
              biggest_time_diff = biggest_time_diff))
}

The output is giving me "240" as max time difference, which I know to be unrealistic. My dataframe refers to the revisit dates of three satellites and none of them should be more than at the very most a month apart.

I thought it could have something to do with the way years are being extracted, but this user seems to have successfully used format() just as I did. lapply() should iterate through each split year group in the same way as group_by(). So what could be the problem in my script?

答案1

得分: 0

Using dplyr:

data %>%
  distinct(date) %>%
  arrange(date) %>%
  group_by(format(date, "%Y")) %>%
  reframe(dateDiff = diff(date)) %>%
  with(list(avg_time_diff = mean(dateDiff),
            smallest_time_diff = min(dateDiff),
            biggest_time_diff = max(dateDiff)))

Result:

$avg_time_diff
Time difference of 30.02198 days

$smallest_time_diff
Time difference of 12 days

$biggest_time_diff
Time difference of 51 days

Dummy data:

data <- data.frame(date = seq(as.Date("2017-01-01"), by = "month", length.out = 100) + sample(0:20, 100, TRUE))
英文:

Using dplyr:

data %&gt;%
  distinct(date) %&gt;% #remove duplicates
  arrange(date) %&gt;% #order by date
  group_by(format(date, &quot;%Y&quot;)) %&gt;% #group by year
  reframe(dateDiff = diff(date)) %&gt;% #apply &#39;diff&#39; to every group
  with(list(avg_time_diff = mean(dateDiff),
            smallest_time_diff = min(dateDiff),
            biggest_time_diff = max(dateDiff))) #create your metrics

Result:

$avg_time_diff
Time difference of 30.02198 days

$smallest_time_diff
Time difference of 12 days

$biggest_time_diff
Time difference of 51 days

Dummy data:

data &lt;- data.frame(date = seq(as.Date(&quot;2017-01-01&quot;), by = &quot;month&quot;, length.out = 100) + sample(0:20, 100, TRUE))

huangapple
  • 本文由 发表于 2023年6月26日 18:55:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76556032.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定