What is the best way to check for consecutive missing values in a data column in R and exclude them based on a related column value?

huangapple go评论77阅读模式
英文:

What is the best way to check for consecutive missing values in a data column in R and exclude them based on a related column value?

问题

我正在尝试用R语言编写代码来检查数据集中的DAYS列是否具有连续的数字,并打印出缺失的DAYS数字,方式是:如果DAYS列的两行之间的连续数字的缺失数量等于PERIOD列中相应最后一行的数量+1,则将其从输出中排除。例如,考虑DAYS列中的两行163和165,其中缺失数字的数量为1。但在这种情况下,最后一行(其中DAYS为165)的PERIOD值为2,即(count+1)。因此,从输出中排除这个缺失值(164)。然而,如果你看DAYS 170和172,你会看到172的PERIOD值为1(而不是2或count+1)。所以,显示这个输出(171)。

这是数据集的前28行。

我尝试过
首先,创建预期的DAYS值序列
expected_days <- seq(min(hs$DAYS), max(hs$DAYS))

然后,找到缺失的DAYS
missing_days <- setdiff(expected_days, hs$DAYS)

下一步该怎么做?

英文:

I am trying to write code in R for a dataset to check if DAYS column have consecutive numbers and print out the missing DAYS number, in such a way that, if the count of missing consecutive numbers between two rows of the DAYS column equals to that count+1 in the corresponding last row of the PERIOD column, exclude it from the output. For example, consider the two rows in DAYS column 163 and 165, where the count of missing number is 1. But in this case, the last row (where DAYS is 165) has PERIOD value of 2, that is (count+1). So, exclude this missing value (164) from the output. However if you look at DAYS 170 and 172,y you can see 172 has PERIOD value of 1 (not 2 or count+1). So, show this output (171).

Here is the first 28 rows of the dataset.

DAYS PERIOD
146	1
147	1
148	1
149	1
150	1
151	1
152	1
153	1
154	1
155	1
156	1
157	1
158	1
159	1
160	1
161	1
162	1
163	1
165	2
166	1
167	1
168	1
169	1
170	1
172	1
173	1
174	1
175	1



I tried
First, created a sequence of expected DAYS values
expected_days <- seq(min(hs$DAYS), max(hs$DAYS))

Then, find the missing DAYS values
missing_days <- setdiff(expected_days, hs$DAYS)

How to do the next bit?

答案1

得分: 0

我已经使用tidyverse工具完成了这个任务:

设置示例数据

我稍微调整了您的数据以显示该解决方案如何处理更长的缺失日期序列。

library(vroom)
library(dplyr)
library(tidyr)

test <-
  vroom(
    I(
"days period
161 1
162 1
163 1
166 3
167 1
168 1
169 1
170 1
172 1
"),
col_types = c("ii"))

在数据框中显式添加“空”日期

all_days <- min(test[["days"]]):max(test[["days"]])

frame <- tibble(days = all_days)

test <-
  right_join(test, frame, by = "days") %>%
  arrange(days)

test

查找连续缺失日期的数量

test <- 
  mutate(test,
         no_na = xor(is.na(period), is.na(lag(period))),
          missingness_group = cumsum(no_na)) %>%
  select(-no_na)

test <- 
  group_by(test, missingness_group) %>%
  mutate(missing_days = 
           case_when(
             all(is.na(period)) ~ n(),
             TRUE               ~ 0)) %>%
  ungroup() %>%
  select(-missingness_group)

test

删除所有日期都有记录的行

test <- mutate(test, extra_days = period - 1)

test <- fill(test, extra_days, .direction = "up")

test <-
  filter(test, !is.na(period) | missing_days > extra_days) %>%
  select(days, period)

test

创建于2023年06月01日,使用reprex v2.0.2

英文:

I've managed to do this using tidyverse tools:

Set up example data

I've tweaked your data slightly to show that the solution can handle longer runs of missing days.

library(vroom)
library(dplyr)
library(tidyr)

test &lt;-
  vroom(
    I(
&quot;days period
161 1
162 1
163 1
166 3
167 1
168 1
169 1
170 1
172 1
&quot;),
col_types = c(&quot;ii&quot;))

Add 'empty' days explicitly to data frame

all_days &lt;- min(test[[&quot;days&quot;]]):max(test[[&quot;days&quot;]])

frame &lt;- tibble(days = all_days)

test &lt;-
  right_join(test, frame, by = &quot;days&quot;) |&gt; 
  arrange(days)

test
#&gt; # A tibble: 12 &#215; 2
#&gt;     days period
#&gt;    &lt;int&gt;  &lt;int&gt;
#&gt;  1   161      1
#&gt;  2   162      1
#&gt;  3   163      1
#&gt;  4   164     NA
#&gt;  5   165     NA
#&gt;  6   166      3
#&gt;  7   167      1
#&gt;  8   168      1
#&gt;  9   169      1
#&gt; 10   170      1
#&gt; 11   171     NA
#&gt; 12   172      1

Find the number of consecutive missing days

test &lt;- 
  mutate(test,
         no_na = xor(is.na(period), is.na(lag(period))),
          missingness_group = cumsum(no_na)) |&gt; 
  select(-no_na)

test &lt;- 
  group_by(test, missingness_group) |&gt; 
  mutate(missing_days = 
           case_when(
             all(is.na(period)) ~ n(),
             TRUE               ~ 0)) |&gt; 
  ungroup() |&gt; 
  select(-missingness_group)

test
#&gt; # A tibble: 12 &#215; 3
#&gt;     days period missing_days
#&gt;    &lt;int&gt;  &lt;int&gt;        &lt;dbl&gt;
#&gt;  1   161      1            0
#&gt;  2   162      1            0
#&gt;  3   163      1            0
#&gt;  4   164     NA            2
#&gt;  5   165     NA            2
#&gt;  6   166      3            0
#&gt;  7   167      1            0
#&gt;  8   168      1            0
#&gt;  9   169      1            0
#&gt; 10   170      1            0
#&gt; 11   171     NA            1
#&gt; 12   172      1            0

Remove rows where days are all accounted for

test &lt;- mutate(test, extra_days = period - 1)

test &lt;- fill(test, extra_days, .direction = &quot;up&quot;)

test &lt;-
  filter(test, !is.na(period) | missing_days &gt; extra_days) |&gt; 
  select(days, period)

test
#&gt; # A tibble: 10 &#215; 2
#&gt;     days period
#&gt;    &lt;int&gt;  &lt;int&gt;
#&gt;  1   161      1
#&gt;  2   162      1
#&gt;  3   163      1
#&gt;  4   166      3
#&gt;  5   167      1
#&gt;  6   168      1
#&gt;  7   169      1
#&gt;  8   170      1
#&gt;  9   171     NA
#&gt; 10   172      1

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

答案2

得分: 0

使用行差异并查看它们是否等于PERIOD列的值(忽略第一行):

hs[c(FALSE, diff(hs$DAYS) != hs$PERIOD[-1]), ]

Tidyverse版本:

library(dplyr)
hs %>%
  filter(c(FALSE, diff(DAYS) != PERIOD[-1]))
英文:

Take the row differences and see if they equal the PERIOD column values (ignoring the first row):

hs[c(FALSE, diff(hs$DAYS) != hs$PERIOD[-1]), ]

Tidyverse version:

library(dplyr)
hs |&gt;
  filter(c(FALSE, diff(DAYS) != PERIOD[-1])

</details>



huangapple
  • 本文由 发表于 2023年6月1日 21:31:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76382432.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定