2023年6月1日 21:31:47go评论107阅读模式

英文:

What is the best way to check for consecutive missing values in a data column in R and exclude them based on a related column value?

问题

我正在尝试用R语言编写代码来检查数据集中的DAYS列是否具有连续的数字，并打印出缺失的DAYS数字，方式是：如果DAYS列的两行之间的连续数字的缺失数量等于PERIOD列中相应最后一行的数量+1，则将其从输出中排除。例如，考虑DAYS列中的两行163和165，其中缺失数字的数量为1。但在这种情况下，最后一行（其中DAYS为165）的PERIOD值为2，即（count+1）。因此，从输出中排除这个缺失值（164）。然而，如果你看DAYS 170和172，你会看到172的PERIOD值为1（而不是2或count+1）。所以，显示这个输出（171）。

这是数据集的前28行。

我尝试过
首先，创建预期的DAYS值序列
expected_days <- seq(min(hs$DAYS), max(hs$DAYS))

然后，找到缺失的DAYS值
missing_days <- setdiff(expected_days, hs$DAYS)

下一步该怎么做？

英文:

I am trying to write code in R for a dataset to check if DAYS column have consecutive numbers and print out the missing DAYS number, in such a way that, if the count of missing consecutive numbers between two rows of the DAYS column equals to that count+1 in the corresponding last row of the PERIOD column, exclude it from the output. For example, consider the two rows in DAYS column 163 and 165, where the count of missing number is 1. But in this case, the last row (where DAYS is 165) has PERIOD value of 2, that is (count+1). So, exclude this missing value (164) from the output. However if you look at DAYS 170 and 172,y you can see 172 has PERIOD value of 1 (not 2 or count+1). So, show this output (171).

Here is the first 28 rows of the dataset.

DAYS PERIOD
146	1
147	1
148	1
149	1
150	1
151	1
152	1
153	1
154	1
155	1
156	1
157	1
158	1
159	1
160	1
161	1
162	1
163	1
165	2
166	1
167	1
168	1
169	1
170	1
172	1
173	1
174	1
175	1

I tried
First, created a sequence of expected DAYS values
expected_days <- seq(min(hs$DAYS), max(hs$DAYS))

Then, find the missing DAYS values
missing_days <- setdiff(expected_days, hs$DAYS)

How to do the next bit?

答案1

得分: 0

我已经使用tidyverse工具完成了这个任务：

设置示例数据

我稍微调整了您的数据以显示该解决方案如何处理更长的缺失日期序列。

library(vroom)
library(dplyr)
library(tidyr)
test <-
  vroom(
    I(
"days period
161 1
162 1
163 1
166 3
167 1
168 1
169 1
170 1
172 1
"),
col_types = c("ii"))

在数据框中显式添加“空”日期

all_days <- min(test[["days"]]):max(test[["days"]])
frame <- tibble(days = all_days)
test <-
  right_join(test, frame, by = "days") %>%
  arrange(days)
test

查找连续缺失日期的数量

test <- 
  mutate(test,
         no_na = xor(is.na(period), is.na(lag(period))),
          missingness_group = cumsum(no_na)) %>%
  select(-no_na)
test <- 
  group_by(test, missingness_group) %>%
  mutate(missing_days = 
           case_when(
             all(is.na(period)) ~ n(),
             TRUE               ~ 0)) %>%
  ungroup() %>%
  select(-missingness_group)
test

删除所有日期都有记录的行

test <- mutate(test, extra_days = period - 1)
test <- fill(test, extra_days, .direction = "up")
test <-
  filter(test, !is.na(period) | missing_days > extra_days) %>%
  select(days, period)
test

^{创建于2023年06月01日，使用reprex v2.0.2}

英文:

I've managed to do this using tidyverse tools:

Set up example data

I've tweaked your data slightly to show that the solution can handle longer runs of missing days.

library(vroom)
library(dplyr)
library(tidyr)
test &lt;-
  vroom(
    I(
&quot;days period
161 1
162 1
163 1
166 3
167 1
168 1
169 1
170 1
172 1
&quot;),
col_types = c(&quot;ii&quot;))

Add 'empty' days explicitly to data frame

all_days &lt;- min(test[[&quot;days&quot;]]):max(test[[&quot;days&quot;]])
frame &lt;- tibble(days = all_days)
test &lt;-
  right_join(test, frame, by = &quot;days&quot;) |&gt; 
  arrange(days)
test
#&gt; # A tibble: 12 &#215; 2
#&gt;     days period
#&gt;    &lt;int&gt;  &lt;int&gt;
#&gt;  1   161      1
#&gt;  2   162      1
#&gt;  3   163      1
#&gt;  4   164     NA
#&gt;  5   165     NA
#&gt;  6   166      3
#&gt;  7   167      1
#&gt;  8   168      1
#&gt;  9   169      1
#&gt; 10   170      1
#&gt; 11   171     NA
#&gt; 12   172      1

Find the number of consecutive missing days

test &lt;- 
  mutate(test,
         no_na = xor(is.na(period), is.na(lag(period))),
          missingness_group = cumsum(no_na)) |&gt; 
  select(-no_na)
test &lt;- 
  group_by(test, missingness_group) |&gt; 
  mutate(missing_days = 
           case_when(
             all(is.na(period)) ~ n(),
             TRUE               ~ 0)) |&gt; 
  ungroup() |&gt; 
  select(-missingness_group)
test
#&gt; # A tibble: 12 &#215; 3
#&gt;     days period missing_days
#&gt;    &lt;int&gt;  &lt;int&gt;        &lt;dbl&gt;
#&gt;  1   161      1            0
#&gt;  2   162      1            0
#&gt;  3   163      1            0
#&gt;  4   164     NA            2
#&gt;  5   165     NA            2
#&gt;  6   166      3            0
#&gt;  7   167      1            0
#&gt;  8   168      1            0
#&gt;  9   169      1            0
#&gt; 10   170      1            0
#&gt; 11   171     NA            1
#&gt; 12   172      1            0

Remove rows where days are all accounted for

test &lt;- mutate(test, extra_days = period - 1)
test &lt;- fill(test, extra_days, .direction = &quot;up&quot;)
test &lt;-
  filter(test, !is.na(period) | missing_days &gt; extra_days) |&gt; 
  select(days, period)
test
#&gt; # A tibble: 10 &#215; 2
#&gt;     days period
#&gt;    &lt;int&gt;  &lt;int&gt;
#&gt;  1   161      1
#&gt;  2   162      1
#&gt;  3   163      1
#&gt;  4   166      3
#&gt;  5   167      1
#&gt;  6   168      1
#&gt;  7   169      1
#&gt;  8   170      1
#&gt;  9   171     NA
#&gt; 10   172      1

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

答案2

得分: 0

使用行差异并查看它们是否等于PERIOD列的值（忽略第一行）：

hs[c(FALSE, diff(hs$DAYS) != hs$PERIOD[-1]), ]

Tidyverse版本：

library(dplyr)
hs %>%
  filter(c(FALSE, diff(DAYS) != PERIOD[-1]))

英文:

Take the row differences and see if they equal the PERIOD column values (ignoring the first row):

hs[c(FALSE, diff(hs$DAYS) != hs$PERIOD[-1]), ]

Tidyverse version:

library(dplyr)
hs |&gt;
  filter(c(FALSE, diff(DAYS) != PERIOD[-1])
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

What is the best way to check for consecutive missing values in a data column in R and exclude them based on a related column value?

问题

答案1

设置示例数据

在数据框中显式添加“空”日期

查找连续缺失日期的数量

删除所有日期都有记录的行

Set up example data

Add 'empty' days explicitly to data frame

Find the number of consecutive missing days

Remove rows where days are all accounted for

答案2

在R中为Distill和/或Quarto网站创建用户/密码登录。

Rvest提取空表格

将雷达数据重新投影到不同的坐标系。

重命名重复的列值，按另一列分组。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。