过滤具有缺失观测数据的重复数据。

huangapple go评论101阅读模式
英文:

Filter repetitive data with missing observations

问题

I have translated the code part as requested. Here is the translated code:

  1. # 数据
  2. df <- data.frame(id = c(rep(1, 16), rep(2, 14)),
  3. stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
  4. 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20),
  5. time = rnorm(30))
  6. # 提取id
  7. ids <- df$id
  8. # 创建以停靠站为列名,时间为值的数据框
  9. subset_df <- df %>% tidyr::pivot_wider(names_from='stop', values_from = 'time')
  10. # 移除不完整的行和ID列
  11. rm_df <- subset_df[complete.cases(subset_df), ]
  12. # 获取剩余的id
  13. remaining_ids <- rm_df$id
  14. # 使用完整案例子集筛选数据框
  15. complete_cases_df <- df[is.element(df$id, remaining_ids),]

请注意,这是您提供的代码的中文翻译,没有其他额外的内容。

英文:

I have a data frame with roughly 20000 observations. The data consists of recordings from journeys along a single bus route, between stop 5 and 20. That is, ideally it would be a column = stop which would be repeating itself as in 5, 6, 7, ..., 19, 20, 5, 6, ... etc. Unfortunately, there is missing data, so that is not always the case.

When there is missing data, I would like to remove the 3 following recordings from the same route. For instance, let's say I have a sequence as 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, where the recording for stops 9 and 18 are missing. I then want to remove the rows of recording 10, 11, 12 and 19, 20.

Is there a neat way to do this in R?

Currently, I am using tidyr to pivot the data, and remove incomplete observations that way, but that removes complete journeys instead of only the subsequent 3. Below is a mock data frame, my current implementation, and a data frame showing the desired output.

  1. # Data
  2. df &lt;- data.frame(id = c(rep(1, 16), rep(2, 14)),
  3. stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
  4. 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20),
  5. time = rnorm(30))
  6. # Extract the ids
  7. ids &lt;- df$id
  8. # Create a journey based data frame with stops as columns and time as values
  9. subset_df &lt;- df %&gt;% tidyr::pivot_wider(names_from=&#39;stop&#39;, values_from = &#39;time&#39;)
  10. # Remove incomplete &amp; ID column
  11. rm_df &lt;- subset_df[complete.cases(subset_df), ]
  12. # Get the remaining ids
  13. remaining_ids &lt;- rm_df$id
  14. # Subset the data frame with the complete cases
  15. complete_cases_df &lt;- df[is.element(df$id, remaining_ids),]

Output I would like to get:

  1. df2 &lt;- data.frame(id = c(rep(1, 16), rep(2, 9)),
  2. stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
  3. 5, 6, 7, 8, 13, 14, 15, 16, 17),
  4. time = rnorm(25))

It would be nice with flexibility in deciding how many subsequent stops are removed.

答案1

得分: 1

这是您提供的代码的翻译结果:

  1. subset(df, !ave(stop, id, FUN = \(x)x %in%
  2. unlist(lapply(x[diff(c(4,x))>1], \(y)seq(y,by=1, length=3)))))

请注意,这只是代码的翻译,不包括任何其他内容。

英文:
  1. subset(df, !ave(stop, id, FUN = \(x)x %in%
  2. unlist(lapply(x[diff(c(4,x))&gt;1], \(y)seq(y,by=1, length=3)))))
  3. id stop time
  4. 1 1 5 0.3374932
  5. 2 1 6 0.2759855
  6. 3 1 7 -1.8097232
  7. 4 1 8 -0.3148637
  8. 5 1 9 0.1555224
  9. 6 1 10 -0.7753229
  10. 7 1 11 -2.3723279
  11. 8 1 12 -0.7845203
  12. 9 1 13 0.1387263
  13. 10 1 14 1.1845789
  14. 11 1 15 1.2969936
  15. 12 1 16 1.1421506
  16. 13 1 17 -2.5839955
  17. 14 1 18 -0.8345220
  18. 15 1 19 0.6228960
  19. 16 1 20 -1.4952315
  20. 17 2 5 -2.5591447
  21. 18 2 6 1.2018367
  22. 19 2 7 1.9759780
  23. 20 2 8 -0.4729620
  24. 24 2 13 -0.4501262
  25. 25 2 14 1.8839690
  26. 26 2 15 0.3628503
  27. 27 2 16 -2.0620141
  28. 28 2 17 -0.1574073

答案2

得分: 1

  1. 我们可以采用以下方法:
  2. 通过使用 arrange(id, stop) 来按正确的顺序排列。
  3. 然后,我们创建三列 gap、gap1、gap2,始终滞后一行。
  4. 这意味着在缺失数字后的 3 行中,我们现在有 TRUE。
  5. 通过将它们筛选出来,我们就得到了结果。
英文:

One way could be:

  1. We bring in the correct order with arrange(id, stop)
  2. The we create three columns gap, gap1, gap2, always lagged one row
  3. This means 3 rows after the lacking number we now have TRUE
  4. By filtering them out we get our result.
  1. library(dplyr)
  2. df %&gt;%
  3. arrange(id, stop) %&gt;%
  4. mutate(gap = stop - lag(stop, default = first(stop)) &gt; 1, .by = id) %&gt;%
  5. mutate(gap1 = lag(gap, default = gap[1]),
  6. gap2 = lag(gap1, default = gap1[1])) %&gt;%
  7. filter(!if_any(starts_with(&quot;gap&quot;), ~. == TRUE)) %&gt;%
  8. select(-starts_with(&quot;gap&quot;)) %&gt;%
  9. as.data.frame()
  1. id stop time
  2. 1 1 5 -0.22336881
  3. 2 1 6 0.36364387
  4. 3 1 7 -1.83544641
  5. 4 1 8 0.01814961
  6. 5 1 9 -1.30122669
  7. 6 1 10 0.84105979
  8. 7 1 11 1.65267900
  9. 8 1 12 -0.60058024
  10. 9 1 13 -0.08852824
  11. 10 1 14 -0.06700669
  12. 11 1 15 -0.83830797
  13. 12 1 16 -1.76628608
  14. 13 1 17 -1.04904130
  15. 14 1 18 -1.35796795
  16. 15 1 19 -0.12457266
  17. 16 1 20 -0.66617671
  18. 17 2 5 -0.44691032
  19. 18 2 6 -0.80684910
  20. 19 2 7 -0.58688365
  21. 20 2 8 0.69196557
  22. 21 2 13 -0.58865687
  23. 22 2 14 1.14572281
  24. 23 2 15 1.32314604
  25. 24 2 16 -1.17715578
  26. 25 2 17 -0.86947491

huangapple
  • 本文由 发表于 2023年5月17日 06:26:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76267450.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定