2023年6月12日 14:35:17go评论101阅读模式

英文:

Replace NA values with the first value in all directions in R

问题

我正在寻找一种方法来填充我的表中的缺失值，使用最接近的日期的非NA值，无论是在参考日期之前还是之后。这意味着像这样的表格：

   date         value
03.03.2023        1
04.03.2023       NA
06.03.2023        4
09.03.2023       NA 
10.03.2023        3

将被填充为：

   date         value
03.03.2023        1
04.03.2023        1
06.03.2023        4
09.03.2023        3 
11.03.2023        3

解释：由于03.03.更接近于04.03.，所以使用了locf。但是由于11.03.更接近于09.03.，所以使用了nocb（locf，fromLast = T）。

如果一个NA值被两个距离参考日期相等的值夹在中间，可能会发生冲突。在这种情况下，我希望locf被优先选择。

我目前的代码使用了"locf"两次（一次作为标准，一次作为fromLast），不够灵活：

read.csv("path/to/merged_data.csv",
         colClasses = c("Date", "numeric", "numeric", "numeric", "character")) %>%
  group_by(field_id) %>%
  arrange(date) %>%
  mutate(
    Nearest_l8_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8), NDVI_l8),
    Nearest_s2_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2), NDVI_s2),
    Nearest_l8_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8, fromLast = TRUE), NDVI_l8),
    Nearest_s2_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2, fromLast = TRUE), NDVI_s2)
  ) %>%
  filter(!is.na(NDVI_l7)) %>%
  select(-NDVI_l8, -NDVI_s2) %>%
  relocate(field_id, .after = last_col()) %>%
  write_csv(file.path(results, "merged_data_interpolated.csv"))

在我的实际情况中，参考日期是所有列（NDVI_l7）不是NA的日期，并且填充NA的过程针对另外两列（NDVI_l8和NDVI_s2）进行。它还根据列“field_id”进行分组，因为每个ID都有重复的日期。

如何调整代码，以便NA值可以使用最接近的日期的值填充，无论它在列中的位置如何？

英文:

I am looking to populate missing values in my table with non-NA value of the closest date, be it before or after the reference date. This means that a table like:

   date         value
03.03.2023        1
04.03.2023       NA
06.03.2023        4
09.03.2023       NA 
10.03.2023        3

Would be filled as:

   date         value
03.03.2023        1
04.03.2023        1
06.03.2023        4
09.03.2023        3 
11.03.2023        3

Explanation: Since 03.03. is closer to 04.03., locf is used. Yet since 11.03. is closer to 09.03., nocb (locf, fromLast = T) is used.

Eventual conflicts could take place if a NA value is flanked by two values of equal distance to the reference date. In this case, I would like locf to be preferred.

The code I have at the moment uses the stiff "locf" twice (once as standard and once as fromLast) and is not as flexible:

read.csv(&quot;path/to/merged_data.csv&quot;,
         colClasses = c(&quot;Date&quot;, &quot;numeric&quot;, &quot;numeric&quot;, &quot;numeric&quot;, &quot;character&quot;)) %&gt;%
  group_by(field_id) %&gt;%
  arrange(date) %&gt;%
  mutate(
    Nearest_l8_locf = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_l8), na.locf(NDVI_l8), NDVI_l8),
    Nearest_s2_locf = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_s2), na.locf(NDVI_s2), NDVI_s2),
    Nearest_l8_locb = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_l8), na.locf(NDVI_l8, fromLast = TRUE), NDVI_l8),
    Nearest_s2_locb = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_s2), na.locf(NDVI_s2, fromLast = TRUE), NDVI_s2)
  ) %&gt;%
  filter(!is.na(NDVI_l7)) %&gt;%
  select(-NDVI_l8, -NDVI_s2) %&gt;%
  relocate(field_id, .after = last_col()) %&gt;%
  write_csv(file.path(results, &quot;merged_data_interpolated.csv&quot;))

In my actual case, the reference date are all dates for which a column (NDVI_l7) is not NA and the procedure to populate NA is done for two other columns (NDVI_l8 and NDVI_s2). It is also grouped by the column "field_id" since dates are repeated for each of those ID.

How can I adapt the code so that NA values are populated with the values of the closest date, regardless of where it is in the column?

答案1

得分: 1

我已成功编写了我所需的函数。

# 查找给定日期的最近数值的函数
find_nearest_value <- function(x, target_date) {
  if (length(which(!is.na(x))) == 0) {
    return(NA)
  }
  idx <- max(which(!is.na(x) & !is.na(target_date) & target_date >= x))
  if (is.na(idx)) {
    idx <- min(which(!is.na(x) & !is.na(target_date) & target_date <= x))
  }
  return(x[idx])
}
# 应用函数
read.csv("path/to/merged_data.csv",
         colClasses = c("Date", "numeric", "numeric", "numeric", "character")) %>%
  group_by(field_id, year = lubridate::year(date)) %>%
  arrange(date) %>%
  mutate(
    Nearest_l8 = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), find_nearest_value(NDVI_l8, date), NDVI_l8),
    Nearest_s2 = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), find_nearest_value(NDVI_s2, date), NDVI_s2),
  ) %>%
  ungroup() %>%
  filter(!is.na(NDVI_l7)) %>%
  select(-NDVI_l8, -NDVI_s2, -year) %>%
  relocate(field_id, .after = last_col()) %>%
  write_csv(file.path(results, "merged_data_function_year.csv"))

请注意，我在我的代码中添加了一个额外的步骤来按年份分组。这是因为结果值不是正常分布的。我的具体数据受到季节性影响，仅涉及四月至七月的月份。将函数的操作限制在单一年份内解决了这个问题。

英文:

I have managed to write a function for what I need.

# Function to find the nearest value to a given date
find_nearest_value &lt;- function(x, target_date) {
  if (length(which(!is.na(x))) == 0) {
    return(NA)
  }
  idx &lt;- max(which(!is.na(x) &amp; !is.na(target_date) &amp; target_date &gt;= x))
  if (is.na(idx)) {
    idx &lt;- min(which(!is.na(x) &amp; !is.na(target_date) &amp; target_date &lt;= x))
  }
  return(x[idx])
}
# Apply function
read.csv(&quot;path/to/merged_data.csv&quot;,
         colClasses = c(&quot;Date&quot;, &quot;numeric&quot;, &quot;numeric&quot;, &quot;numeric&quot;, &quot;character&quot;)) %&gt;%
  group_by(field_id, year = lubridate::year(date)) %&gt;%
  arrange(date) %&gt;%
  mutate(
    Nearest_l8 = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_l8), find_nearest_value(NDVI_l8, date), NDVI_l8),
    Nearest_s2 = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_s2), find_nearest_value(NDVI_s2, date), NDVI_s2),
  ) %&gt;%
  ungroup() %&gt;%
  filter(!is.na(NDVI_l7)) %&gt;%
  select(-NDVI_l8, -NDVI_s2, -year) %&gt;%
  relocate(field_id, .after = last_col()) %&gt;%
  write_csv(file.path(results, &quot;merged_data_function_year.csv&quot;))

Note that I added an extra step to group by year in my code. This is because the resulting values weren't being normally distributed. My specific data is affected by seasonality and only pertains to the months of April through July. Restraining the operation of the function to a single year solved the problem.

答案2

得分: 0

以下是您提供的代码的中文翻译：

另一种仅使用基本R的变体：

d 作为您的示例数据：

d <- structure(list(date = structure(c(19419, 19420, 19422, 19425, 19426), class = "Date"), value = c(1L, NA, 4L, NA, 3L)), row.names = c(NA, 5L), class = "data.frame")

将列 date 转换为 Date 类型：

d$date <- as.Date(d$date, '%d.%m.%Y')

利用 dist 函数找到最近的邻居：

impute_from_neighbours <- function(values, dates){
  dists <- dist(dates) |> as.matrix()
  dists[dists == 0] <- NA
  na_pos <- which(is.na(values))
  closest_non_na_pos <- apply(dists[, na_pos], 2, which.min)
  values[na_pos] <- values[closest_non_na_pos]
  values
}

d$value <- impute_from_neighbours(d$value, d$date)

输出：

> d
        date value
1 2023-03-03     1
2 2023-03-04     1
3 2023-03-06     4
4 2023-03-09     3
5 2023-03-10     3

请注意，以上是您提供的代码的中文翻译部分，没有包含其他内容。

英文:

another variant with base R only:

d being your example data:

d &lt;- structure(list(date = structure(c(19419, 19420, 19422, 19425, 
19426), class = &quot;Date&quot;), value = c(1L, NA, 4L, NA, 3L)), row.names = c(NA, 
5L), class = &quot;data.frame&quot;)

convert column date to class Date:

d$date &lt;- as.Date(d$date, &#39;%d.%m.%Y&#39;)

exploit the distance function to find closest neighbour:

impute_from_neighbours &lt;- function(values, dates){
  dists &lt;- dist(dates) |&gt; as.matrix()
  dists[dists == 0] &lt;- NA
  na_pos &lt;- which(is.na(values))
  closest_non_na_pos &lt;- apply(dists[, na_pos], 2, which.min)
  values[na_pos] &lt;- values[closest_non_na_pos]
  values
}

d$value &lt;- impute_from_neighbours(d$value, d$date)

output:

&gt; d
        date value
1 2023-03-03     1
2 2023-03-04     1
3 2023-03-06     4
4 2023-03-09     3
5 2023-03-10     3

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用R将NA值替换为所有方向的第一个值。

问题

答案1

答案2

How do I add geom_errorbars() using a separate dataset to an existing geom_violin() dataset in ggplot2?

为什么deSolve中的ode函数总是在时间t = 0时触发事件？

如何将两个ggplot2自定义函数合并，使图例分开。

合并特定的 x 轴标签使用 ggarrange

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。