2023年5月17日 06:26:59go评论101阅读模式

英文:

Filter repetitive data with missing observations

问题

I have translated the code part as requested. Here is the translated code:

# 数据
df <- data.frame(id = c(rep(1, 16), rep(2, 14)), 
                 stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
                         5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20), 
                 time = rnorm(30))
# 提取id
ids <- df$id
# 创建以停靠站为列名，时间为值的数据框
subset_df <- df %>% tidyr::pivot_wider(names_from='stop', values_from = 'time')
# 移除不完整的行和ID列
rm_df <- subset_df[complete.cases(subset_df), ]
# 获取剩余的id
remaining_ids <- rm_df$id
# 使用完整案例子集筛选数据框
complete_cases_df <- df[is.element(df$id, remaining_ids),]

请注意，这是您提供的代码的中文翻译，没有其他额外的内容。

英文:

I have a data frame with roughly 20000 observations. The data consists of recordings from journeys along a single bus route, between stop 5 and 20. That is, ideally it would be a column = stop which would be repeating itself as in 5, 6, 7, ..., 19, 20, 5, 6, ... etc. Unfortunately, there is missing data, so that is not always the case.

When there is missing data, I would like to remove the 3 following recordings from the same route. For instance, let's say I have a sequence as 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, where the recording for stops 9 and 18 are missing. I then want to remove the rows of recording 10, 11, 12 and 19, 20.

Is there a neat way to do this in R?

Currently, I am using tidyr to pivot the data, and remove incomplete observations that way, but that removes complete journeys instead of only the subsequent 3. Below is a mock data frame, my current implementation, and a data frame showing the desired output.

# Data
df &lt;- data.frame(id = c(rep(1, 16), rep(2, 14)), 
                 stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
                         5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20), 
                 time = rnorm(30))
# Extract the ids
ids &lt;- df$id
# Create a journey based data frame with stops as columns and time as values
subset_df &lt;- df %&gt;% tidyr::pivot_wider(names_from=&#39;stop&#39;, values_from = &#39;time&#39;)
# Remove incomplete &amp; ID column
rm_df &lt;- subset_df[complete.cases(subset_df), ]
# Get the remaining ids
remaining_ids &lt;- rm_df$id
# Subset the data frame with the complete cases
complete_cases_df &lt;- df[is.element(df$id, remaining_ids),]

Output I would like to get:

df2 &lt;- data.frame(id = c(rep(1, 16), rep(2, 9)), 
                 stop=c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
                        5, 6, 7, 8, 13, 14, 15, 16, 17), 
                 time = rnorm(25))

It would be nice with flexibility in deciding how many subsequent stops are removed.

答案1

得分: 1

这是您提供的代码的翻译结果：

subset(df, !ave(stop, id, FUN = \(x)x %in% 
           unlist(lapply(x[diff(c(4,x))>1], \(y)seq(y,by=1, length=3)))))

请注意，这只是代码的翻译，不包括任何其他内容。

英文:

subset(df, !ave(stop, id, FUN = \(x)x %in% 
           unlist(lapply(x[diff(c(4,x))&gt;1], \(y)seq(y,by=1, length=3)))))
   id stop       time
1   1    5  0.3374932
2   1    6  0.2759855
3   1    7 -1.8097232
4   1    8 -0.3148637
5   1    9  0.1555224
6   1   10 -0.7753229
7   1   11 -2.3723279
8   1   12 -0.7845203
9   1   13  0.1387263
10  1   14  1.1845789
11  1   15  1.2969936
12  1   16  1.1421506
13  1   17 -2.5839955
14  1   18 -0.8345220
15  1   19  0.6228960
16  1   20 -1.4952315
17  2    5 -2.5591447
18  2    6  1.2018367
19  2    7  1.9759780
20  2    8 -0.4729620
24  2   13 -0.4501262
25  2   14  1.8839690
26  2   15  0.3628503
27  2   16 -2.0620141
28  2   17 -0.1574073

答案2

得分: 1

我们可以采用以下方法：
通过使用 arrange(id, stop) 来按正确的顺序排列。
然后，我们创建三列 gap、gap1、gap2，始终滞后一行。
这意味着在缺失数字后的 3 行中，我们现在有 TRUE。
通过将它们筛选出来，我们就得到了结果。

英文:

One way could be:

We bring in the correct order with arrange(id, stop)
The we create three columns gap, gap1, gap2, always lagged one row
This means 3 rows after the lacking number we now have TRUE
By filtering them out we get our result.

library(dplyr)
df %&gt;%
  arrange(id, stop) %&gt;%
  mutate(gap = stop - lag(stop, default = first(stop)) &gt; 1, .by = id) %&gt;%
  mutate(gap1 = lag(gap, default = gap[1]),
         gap2 = lag(gap1, default = gap1[1])) %&gt;% 
  filter(!if_any(starts_with(&quot;gap&quot;), ~. == TRUE)) %&gt;% 
  select(-starts_with(&quot;gap&quot;)) %&gt;% 
  as.data.frame()

 id stop        time
1   1    5 -0.22336881
2   1    6  0.36364387
3   1    7 -1.83544641
4   1    8  0.01814961
5   1    9 -1.30122669
6   1   10  0.84105979
7   1   11  1.65267900
8   1   12 -0.60058024
9   1   13 -0.08852824
10  1   14 -0.06700669
11  1   15 -0.83830797
12  1   16 -1.76628608
13  1   17 -1.04904130
14  1   18 -1.35796795
15  1   19 -0.12457266
16  1   20 -0.66617671
17  2    5 -0.44691032
18  2    6 -0.80684910
19  2    7 -0.58688365
20  2    8  0.69196557
21  2   13 -0.58865687
22  2   14  1.14572281
23  2   15  1.32314604
24  2   16 -1.17715578
25  2   17 -0.86947491

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

过滤具有缺失观测数据的重复数据。

问题

答案1

答案2

在pandas中比较行的子集。

条件数学运算与pandas数据框中的列

添加工具提示到R闪亮仪表板valueBox时出现问题。

在ggplot2中，根据给定条件对绘制的线进行对齐。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。