找到R中每年的最大时间差

huangapple go评论103阅读模式
英文:

Find max time difference within each year in R

问题

这是您要翻译的内容:

"I have a function that calculates the average, min and max values for each year in my dataframe, then merges them to output the alltime average, min and max values. Each year needs to be calculated separately first because my dates only refer to the months of April through August. If I didn't group it by year, there would be calculations between August of one year and April of the next year. I want to avoid this.

Example dataframe:

  1. date NDVI cloud_cover field_id
  2. 23/04/2017 0.6494 12 KM60
  3. 23/04/2017 0.5683 0 KM1
  4. 05/05/2017 0.3467 0 KM60
  5. 31/07/2017 0.6743 05 KM60
  6. 31/07/2017 NA 97 KM1
  7. 31/07/2017 0.3456 07 LM27
  8. 01/04/2018 NA 100 KM60
  9. 03/06/2018 0.6743 11 KM60
  10. 03/06/2018 0.2346 12 KM1
  11. 04/05/2019 NA 99 KM60
  12. 05/05/2019 0.5432 20 KM60

NDVI and cloud_cover shouldn't influence calculations. Although field_ids most times provide the same dates, this also shouldn't influence them.

This is the current code:

  1. calculate_time_diff <- function(df) {
  2. # Convert "date" column to datetime
  3. df$date <- as.POSIXct(df$date)
  4. # Group the data by year
  5. df_calc <- split(df, format(df$date, "%Y"))
  6. # Calculate time differences between consecutive observations for each year
  7. time_diffs <- lapply(df_calc, function(group) {
  8. # Sort dataframe based on "date"
  9. group <- group[order(group$date), ]
  10. # Filter out duplicate dates
  11. group <- group[!duplicated(group$date), ]
  12. # Calculate time differences between consecutive observations
  13. diff(group$date)
  14. })
  15. # Combine time differences from all years into a single vector
  16. all_time_diffs <- unlist(time_diffs)
  17. # Compute average time difference
  18. avg_time_diff <- mean(all_time_diffs)
  19. # Calculate smallest and biggest time differences
  20. smallest_time_diff <- min(all_time_diffs)
  21. biggest_time_diff <- max(all_time_diffs)
  22. return(list(avg_time_diff = avg_time_diff,
  23. smallest_time_diff = smallest_time_diff,
  24. biggest_time_diff = biggest_time_diff))
  25. }

The output is giving me "240" as max time difference, which I know to be unrealistic. My dataframe refers to the revisit dates of three satellites and none of them should be more than at the very most a month apart.

I thought it could have something to do with the way years are being extracted, but this user seems to have successfully used format() just as I did. lapply() should iterate through each split year group in the same way as group_by(). So what could be the problem in my script?"

英文:

I have a function that calculates the average, min and max values for each year in my dataframe, then merges them to output the alltime average, min and max values. Each year needs to be calculated separately first because my dates only refer to the months of April through August. If I didn't group it by year, there would be calculations between August of one year and April of the next year. I want to avoid this.

Example dataframe:

  1. date NDVI cloud_cover field_id
  2. 23/04/2017 0.6494 12 KM60
  3. 23/04/2017 0.5683 0 KM1
  4. 05/05/2017 0.3467 0 KM60
  5. 31/07/2017 0.6743 05 KM60
  6. 31/07/2017 NA 97 KM1
  7. 31/07/2017 0.3456 07 LM27
  8. 01/04/2018 NA 100 KM60
  9. 03/06/2018 0.6743 11 KM60
  10. 03/06/2018 0.2346 12 KM1
  11. 04/05/2019 NA 99 KM60
  12. 05/05/2019 0.5432 20 KM60

NDVI and cloud_cover shouldn't influence calculations. Although field_ids most times provide the same dates, this also shouldn't influence them.

This is the current code:

  1. calculate_time_diff &lt;- function(df) {
  2. # Convert &quot;date&quot; column to datetime
  3. df$date &lt;- as.POSIXct(df$date)
  4. # Group the data by year
  5. df_calc &lt;- split(df, format(df$date, &quot;%Y&quot;))
  6. # Calculate time differences between consecutive observations for each year
  7. time_diffs &lt;- lapply(df_calc, function(group) {
  8. # Sort dataframe based on &quot;date&quot;
  9. group &lt;- group[order(group$date), ]
  10. # Filter out duplicate dates
  11. group &lt;- group[!duplicated(group$date), ]
  12. # Calculate time differences between consecutive observations
  13. diff(group$date)
  14. })
  15. # Combine time differences from all years into a single vector
  16. all_time_diffs &lt;- unlist(time_diffs)
  17. # Compute average time difference
  18. avg_time_diff &lt;- mean(all_time_diffs)
  19. # Calculate smallest and biggest time differences
  20. smallest_time_diff &lt;- min(all_time_diffs)
  21. biggest_time_diff &lt;- max(all_time_diffs)
  22. return(list(avg_time_diff = avg_time_diff,
  23. smallest_time_diff = smallest_time_diff,
  24. biggest_time_diff = biggest_time_diff))
  25. }

The output is giving me "240" as max time difference, which I know to be unrealistic. My dataframe refers to the revisit dates of three satellites and none of them should be more than at the very most a month apart.

I thought it could have something to do with the way years are being extracted, but this user seems to have successfully used format() just as I did. lapply() should iterate through each split year group in the same way as group_by(). So what could be the problem in my script?

答案1

得分: 0

Using dplyr:

  1. data %>%
  2. distinct(date) %>%
  3. arrange(date) %>%
  4. group_by(format(date, "%Y")) %>%
  5. reframe(dateDiff = diff(date)) %>%
  6. with(list(avg_time_diff = mean(dateDiff),
  7. smallest_time_diff = min(dateDiff),
  8. biggest_time_diff = max(dateDiff)))

Result:

  1. $avg_time_diff
  2. Time difference of 30.02198 days
  3. $smallest_time_diff
  4. Time difference of 12 days
  5. $biggest_time_diff
  6. Time difference of 51 days

Dummy data:

  1. data <- data.frame(date = seq(as.Date("2017-01-01"), by = "month", length.out = 100) + sample(0:20, 100, TRUE))
英文:

Using dplyr:

  1. data %&gt;%
  2. distinct(date) %&gt;% #remove duplicates
  3. arrange(date) %&gt;% #order by date
  4. group_by(format(date, &quot;%Y&quot;)) %&gt;% #group by year
  5. reframe(dateDiff = diff(date)) %&gt;% #apply &#39;diff&#39; to every group
  6. with(list(avg_time_diff = mean(dateDiff),
  7. smallest_time_diff = min(dateDiff),
  8. biggest_time_diff = max(dateDiff))) #create your metrics

Result:

  1. $avg_time_diff
  2. Time difference of 30.02198 days
  3. $smallest_time_diff
  4. Time difference of 12 days
  5. $biggest_time_diff
  6. Time difference of 51 days

Dummy data:

  1. data &lt;- data.frame(date = seq(as.Date(&quot;2017-01-01&quot;), by = &quot;month&quot;, length.out = 100) + sample(0:20, 100, TRUE))

huangapple
  • 本文由 发表于 2023年6月26日 18:55:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76556032.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定