过滤具有缺失观测数据的重复数据。

huangapple go评论58阅读模式
英文:

Filter repetitive data with missing observations

问题

I have translated the code part as requested. Here is the translated code:

# 数据
df <- data.frame(id = c(rep(1, 16), rep(2, 14)), 
                 stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
                         5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20), 
                 time = rnorm(30))
# 提取id
ids <- df$id
# 创建以停靠站为列名,时间为值的数据框
subset_df <- df %>% tidyr::pivot_wider(names_from='stop', values_from = 'time')

# 移除不完整的行和ID列
rm_df <- subset_df[complete.cases(subset_df), ]

# 获取剩余的id
remaining_ids <- rm_df$id

# 使用完整案例子集筛选数据框
complete_cases_df <- df[is.element(df$id, remaining_ids),]

请注意,这是您提供的代码的中文翻译,没有其他额外的内容。

英文:

I have a data frame with roughly 20000 observations. The data consists of recordings from journeys along a single bus route, between stop 5 and 20. That is, ideally it would be a column = stop which would be repeating itself as in 5, 6, 7, ..., 19, 20, 5, 6, ... etc. Unfortunately, there is missing data, so that is not always the case.

When there is missing data, I would like to remove the 3 following recordings from the same route. For instance, let's say I have a sequence as 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, where the recording for stops 9 and 18 are missing. I then want to remove the rows of recording 10, 11, 12 and 19, 20.

Is there a neat way to do this in R?

Currently, I am using tidyr to pivot the data, and remove incomplete observations that way, but that removes complete journeys instead of only the subsequent 3. Below is a mock data frame, my current implementation, and a data frame showing the desired output.

# Data
df &lt;- data.frame(id = c(rep(1, 16), rep(2, 14)), 
                 stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
                         5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20), 
                 time = rnorm(30))
# Extract the ids
ids &lt;- df$id
# Create a journey based data frame with stops as columns and time as values
subset_df &lt;- df %&gt;% tidyr::pivot_wider(names_from=&#39;stop&#39;, values_from = &#39;time&#39;)

# Remove incomplete &amp; ID column
rm_df &lt;- subset_df[complete.cases(subset_df), ]

# Get the remaining ids
remaining_ids &lt;- rm_df$id

# Subset the data frame with the complete cases
complete_cases_df &lt;- df[is.element(df$id, remaining_ids),]

Output I would like to get:

df2 &lt;- data.frame(id = c(rep(1, 16), rep(2, 9)), 
                 stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
                        5, 6, 7, 8, 13, 14, 15, 16, 17), 
                 time = rnorm(25))

It would be nice with flexibility in deciding how many subsequent stops are removed.

答案1

得分: 1

这是您提供的代码的翻译结果:

subset(df, !ave(stop, id, FUN = \(x)x %in% 
           unlist(lapply(x[diff(c(4,x))>1], \(y)seq(y,by=1, length=3)))))

请注意,这只是代码的翻译,不包括任何其他内容。

英文:
subset(df, !ave(stop, id, FUN = \(x)x %in% 
           unlist(lapply(x[diff(c(4,x))&gt;1], \(y)seq(y,by=1, length=3)))))

   id stop       time
1   1    5  0.3374932
2   1    6  0.2759855
3   1    7 -1.8097232
4   1    8 -0.3148637
5   1    9  0.1555224
6   1   10 -0.7753229
7   1   11 -2.3723279
8   1   12 -0.7845203
9   1   13  0.1387263
10  1   14  1.1845789
11  1   15  1.2969936
12  1   16  1.1421506
13  1   17 -2.5839955
14  1   18 -0.8345220
15  1   19  0.6228960
16  1   20 -1.4952315
17  2    5 -2.5591447
18  2    6  1.2018367
19  2    7  1.9759780
20  2    8 -0.4729620
24  2   13 -0.4501262
25  2   14  1.8839690
26  2   15  0.3628503
27  2   16 -2.0620141
28  2   17 -0.1574073

答案2

得分: 1

  1. 我们可以采用以下方法:
  2. 通过使用 arrange(id, stop) 来按正确的顺序排列。
  3. 然后,我们创建三列 gap、gap1、gap2,始终滞后一行。
  4. 这意味着在缺失数字后的 3 行中,我们现在有 TRUE。
  5. 通过将它们筛选出来,我们就得到了结果。
英文:

One way could be:

  1. We bring in the correct order with arrange(id, stop)
  2. The we create three columns gap, gap1, gap2, always lagged one row
  3. This means 3 rows after the lacking number we now have TRUE
  4. By filtering them out we get our result.
library(dplyr)


df %&gt;%
  arrange(id, stop) %&gt;%
  mutate(gap = stop - lag(stop, default = first(stop)) &gt; 1, .by = id) %&gt;%
  mutate(gap1 = lag(gap, default = gap[1]),
         gap2 = lag(gap1, default = gap1[1])) %&gt;% 
  filter(!if_any(starts_with(&quot;gap&quot;), ~. == TRUE)) %&gt;% 
  select(-starts_with(&quot;gap&quot;)) %&gt;% 
  as.data.frame()
 id stop        time
1   1    5 -0.22336881
2   1    6  0.36364387
3   1    7 -1.83544641
4   1    8  0.01814961
5   1    9 -1.30122669
6   1   10  0.84105979
7   1   11  1.65267900
8   1   12 -0.60058024
9   1   13 -0.08852824
10  1   14 -0.06700669
11  1   15 -0.83830797
12  1   16 -1.76628608
13  1   17 -1.04904130
14  1   18 -1.35796795
15  1   19 -0.12457266
16  1   20 -0.66617671
17  2    5 -0.44691032
18  2    6 -0.80684910
19  2    7 -0.58688365
20  2    8  0.69196557
21  2   13 -0.58865687
22  2   14  1.14572281
23  2   15  1.32314604
24  2   16 -1.17715578
25  2   17 -0.86947491

huangapple
  • 本文由 发表于 2023年5月17日 06:26:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76267450.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定