英文:
What is the best way to check for consecutive missing values in a data column in R and exclude them based on a related column value?
问题
我正在尝试用R语言编写代码来检查数据集中的DAYS列是否具有连续的数字,并打印出缺失的DAYS数字,方式是:如果DAYS列的两行之间的连续数字的缺失数量等于PERIOD列中相应最后一行的数量+1,则将其从输出中排除。例如,考虑DAYS列中的两行163和165,其中缺失数字的数量为1。但在这种情况下,最后一行(其中DAYS为165)的PERIOD值为2,即(count+1)。因此,从输出中排除这个缺失值(164)。然而,如果你看DAYS 170和172,你会看到172的PERIOD值为1(而不是2或count+1)。所以,显示这个输出(171)。
这是数据集的前28行。
我尝试过
首先,创建预期的DAYS
值序列
expected_days <- seq(min(hs$DAYS), max(hs$DAYS))
然后,找到缺失的DAYS
值
missing_days <- setdiff(expected_days, hs$DAYS)
下一步该怎么做?
英文:
I am trying to write code in R for a dataset to check if DAYS column have consecutive numbers and print out the missing DAYS number, in such a way that, if the count of missing consecutive numbers between two rows of the DAYS column equals to that count+1 in the corresponding last row of the PERIOD column, exclude it from the output. For example, consider the two rows in DAYS column 163 and 165, where the count of missing number is 1. But in this case, the last row (where DAYS is 165) has PERIOD value of 2, that is (count+1). So, exclude this missing value (164) from the output. However if you look at DAYS 170 and 172,y you can see 172 has PERIOD value of 1 (not 2 or count+1). So, show this output (171).
Here is the first 28 rows of the dataset.
DAYS PERIOD
146 1
147 1
148 1
149 1
150 1
151 1
152 1
153 1
154 1
155 1
156 1
157 1
158 1
159 1
160 1
161 1
162 1
163 1
165 2
166 1
167 1
168 1
169 1
170 1
172 1
173 1
174 1
175 1
I tried
First, created a sequence of expected DAYS
values
expected_days <- seq(min(hs$DAYS), max(hs$DAYS))
Then, find the missing DAYS
values
missing_days <- setdiff(expected_days, hs$DAYS)
How to do the next bit?
答案1
得分: 0
我已经使用tidyverse工具完成了这个任务:
设置示例数据
我稍微调整了您的数据以显示该解决方案如何处理更长的缺失日期序列。
library(vroom)
library(dplyr)
library(tidyr)
test <-
vroom(
I(
"days period
161 1
162 1
163 1
166 3
167 1
168 1
169 1
170 1
172 1
"),
col_types = c("ii"))
在数据框中显式添加“空”日期
all_days <- min(test[["days"]]):max(test[["days"]])
frame <- tibble(days = all_days)
test <-
right_join(test, frame, by = "days") %>%
arrange(days)
test
查找连续缺失日期的数量
test <-
mutate(test,
no_na = xor(is.na(period), is.na(lag(period))),
missingness_group = cumsum(no_na)) %>%
select(-no_na)
test <-
group_by(test, missingness_group) %>%
mutate(missing_days =
case_when(
all(is.na(period)) ~ n(),
TRUE ~ 0)) %>%
ungroup() %>%
select(-missingness_group)
test
删除所有日期都有记录的行
test <- mutate(test, extra_days = period - 1)
test <- fill(test, extra_days, .direction = "up")
test <-
filter(test, !is.na(period) | missing_days > extra_days) %>%
select(days, period)
test
创建于2023年06月01日,使用reprex v2.0.2
英文:
I've managed to do this using tidyverse tools:
Set up example data
I've tweaked your data slightly to show that the solution can handle longer runs of missing days.
library(vroom)
library(dplyr)
library(tidyr)
test <-
vroom(
I(
"days period
161 1
162 1
163 1
166 3
167 1
168 1
169 1
170 1
172 1
"),
col_types = c("ii"))
Add 'empty' days explicitly to data frame
all_days <- min(test[["days"]]):max(test[["days"]])
frame <- tibble(days = all_days)
test <-
right_join(test, frame, by = "days") |>
arrange(days)
test
#> # A tibble: 12 × 2
#> days period
#> <int> <int>
#> 1 161 1
#> 2 162 1
#> 3 163 1
#> 4 164 NA
#> 5 165 NA
#> 6 166 3
#> 7 167 1
#> 8 168 1
#> 9 169 1
#> 10 170 1
#> 11 171 NA
#> 12 172 1
Find the number of consecutive missing days
test <-
mutate(test,
no_na = xor(is.na(period), is.na(lag(period))),
missingness_group = cumsum(no_na)) |>
select(-no_na)
test <-
group_by(test, missingness_group) |>
mutate(missing_days =
case_when(
all(is.na(period)) ~ n(),
TRUE ~ 0)) |>
ungroup() |>
select(-missingness_group)
test
#> # A tibble: 12 × 3
#> days period missing_days
#> <int> <int> <dbl>
#> 1 161 1 0
#> 2 162 1 0
#> 3 163 1 0
#> 4 164 NA 2
#> 5 165 NA 2
#> 6 166 3 0
#> 7 167 1 0
#> 8 168 1 0
#> 9 169 1 0
#> 10 170 1 0
#> 11 171 NA 1
#> 12 172 1 0
Remove rows where days are all accounted for
test <- mutate(test, extra_days = period - 1)
test <- fill(test, extra_days, .direction = "up")
test <-
filter(test, !is.na(period) | missing_days > extra_days) |>
select(days, period)
test
#> # A tibble: 10 × 2
#> days period
#> <int> <int>
#> 1 161 1
#> 2 162 1
#> 3 163 1
#> 4 166 3
#> 5 167 1
#> 6 168 1
#> 7 169 1
#> 8 170 1
#> 9 171 NA
#> 10 172 1
<sup>Created on 2023-06-01 with reprex v2.0.2</sup>
答案2
得分: 0
使用行差异并查看它们是否等于PERIOD
列的值(忽略第一行):
hs[c(FALSE, diff(hs$DAYS) != hs$PERIOD[-1]), ]
Tidyverse版本:
library(dplyr)
hs %>%
filter(c(FALSE, diff(DAYS) != PERIOD[-1]))
英文:
Take the row differences and see if they equal the PERIOD
column values (ignoring the first row):
hs[c(FALSE, diff(hs$DAYS) != hs$PERIOD[-1]), ]
Tidyverse version:
library(dplyr)
hs |>
filter(c(FALSE, diff(DAYS) != PERIOD[-1])
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论