用R将NA值替换为所有方向的第一个值。

huangapple go评论68阅读模式
英文:

Replace NA values with the first value in all directions in R

问题

我正在寻找一种方法来填充我的表中的缺失值,使用最接近的日期的非NA值,无论是在参考日期之前还是之后。这意味着像这样的表格:

   date         value
03.03.2023        1
04.03.2023       NA
06.03.2023        4
09.03.2023       NA 
10.03.2023        3

将被填充为:

   date         value
03.03.2023        1
04.03.2023        1
06.03.2023        4
09.03.2023        3 
11.03.2023        3

解释:由于03.03.更接近于04.03.,所以使用了locf。但是由于11.03.更接近于09.03.,所以使用了nocb(locf,fromLast = T)。

如果一个NA值被两个距离参考日期相等的值夹在中间,可能会发生冲突。在这种情况下,我希望locf被优先选择。

我目前的代码使用了"locf"两次(一次作为标准,一次作为fromLast),不够灵活:

read.csv("path/to/merged_data.csv",
         colClasses = c("Date", "numeric", "numeric", "numeric", "character")) %>%
  group_by(field_id) %>%
  arrange(date) %>%
  mutate(
    Nearest_l8_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8), NDVI_l8),
    Nearest_s2_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2), NDVI_s2),
    Nearest_l8_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8, fromLast = TRUE), NDVI_l8),
    Nearest_s2_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2, fromLast = TRUE), NDVI_s2)
  ) %>%
  filter(!is.na(NDVI_l7)) %>%
  select(-NDVI_l8, -NDVI_s2) %>%
  relocate(field_id, .after = last_col()) %>%
  write_csv(file.path(results, "merged_data_interpolated.csv"))

在我的实际情况中,参考日期是所有列(NDVI_l7)不是NA的日期,并且填充NA的过程针对另外两列(NDVI_l8和NDVI_s2)进行。它还根据列“field_id”进行分组,因为每个ID都有重复的日期。

如何调整代码,以便NA值可以使用最接近的日期的值填充,无论它在列中的位置如何?

英文:

I am looking to populate missing values in my table with non-NA value of the closest date, be it before or after the reference date. This means that a table like:

   date         value
03.03.2023        1
04.03.2023       NA
06.03.2023        4
09.03.2023       NA 
10.03.2023        3

Would be filled as:

   date         value
03.03.2023        1
04.03.2023        1
06.03.2023        4
09.03.2023        3 
11.03.2023        3

Explanation: Since 03.03. is closer to 04.03., locf is used. Yet since 11.03. is closer to 09.03., nocb (locf, fromLast = T) is used.

Eventual conflicts could take place if a NA value is flanked by two values of equal distance to the reference date. In this case, I would like locf to be preferred.

The code I have at the moment uses the stiff "locf" twice (once as standard and once as fromLast) and is not as flexible:

read.csv("path/to/merged_data.csv",
         colClasses = c("Date", "numeric", "numeric", "numeric", "character")) %>%
  group_by(field_id) %>%
  arrange(date) %>%
  mutate(
    Nearest_l8_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8), NDVI_l8),
    Nearest_s2_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2), NDVI_s2),
    Nearest_l8_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8, fromLast = TRUE), NDVI_l8),
    Nearest_s2_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2, fromLast = TRUE), NDVI_s2)
  ) %>%
  filter(!is.na(NDVI_l7)) %>%
  select(-NDVI_l8, -NDVI_s2) %>%
  relocate(field_id, .after = last_col()) %>%
  write_csv(file.path(results, "merged_data_interpolated.csv"))

In my actual case, the reference date are all dates for which a column (NDVI_l7) is not NA and the procedure to populate NA is done for two other columns (NDVI_l8 and NDVI_s2). It is also grouped by the column "field_id" since dates are repeated for each of those ID.

How can I adapt the code so that NA values are populated with the values of the closest date, regardless of where it is in the column?

答案1

得分: 1

我已成功编写了我所需的函数。

# 查找给定日期的最近数值的函数

find_nearest_value <- function(x, target_date) {
  if (length(which(!is.na(x))) == 0) {
    return(NA)
  }
  idx <- max(which(!is.na(x) & !is.na(target_date) & target_date >= x))
  if (is.na(idx)) {
    idx <- min(which(!is.na(x) & !is.na(target_date) & target_date <= x))
  }
  return(x[idx])
}

# 应用函数
read.csv("path/to/merged_data.csv",
         colClasses = c("Date", "numeric", "numeric", "numeric", "character")) %>%
  group_by(field_id, year = lubridate::year(date)) %>%
  arrange(date) %>%
  mutate(
    Nearest_l8 = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), find_nearest_value(NDVI_l8, date), NDVI_l8),
    Nearest_s2 = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), find_nearest_value(NDVI_s2, date), NDVI_s2),
  ) %>%
  ungroup() %>%
  filter(!is.na(NDVI_l7)) %>%
  select(-NDVI_l8, -NDVI_s2, -year) %>%
  relocate(field_id, .after = last_col()) %>%
  write_csv(file.path(results, "merged_data_function_year.csv"))

请注意,我在我的代码中添加了一个额外的步骤来按年份分组。这是因为结果值不是正常分布的。我的具体数据受到季节性影响,仅涉及四月至七月的月份。将函数的操作限制在单一年份内解决了这个问题。

英文:

I have managed to write a function for what I need.

# Function to find the nearest value to a given date

find_nearest_value &lt;- function(x, target_date) {
  if (length(which(!is.na(x))) == 0) {
    return(NA)
  }
  idx &lt;- max(which(!is.na(x) &amp; !is.na(target_date) &amp; target_date &gt;= x))
  if (is.na(idx)) {
    idx &lt;- min(which(!is.na(x) &amp; !is.na(target_date) &amp; target_date &lt;= x))
  }
  return(x[idx])
}

# Apply function
read.csv(&quot;path/to/merged_data.csv&quot;,
         colClasses = c(&quot;Date&quot;, &quot;numeric&quot;, &quot;numeric&quot;, &quot;numeric&quot;, &quot;character&quot;)) %&gt;%
  group_by(field_id, year = lubridate::year(date)) %&gt;%
  arrange(date) %&gt;%
  mutate(
    Nearest_l8 = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_l8), find_nearest_value(NDVI_l8, date), NDVI_l8),
    Nearest_s2 = ifelse(!is.na(NDVI_l7) &amp; is.na(NDVI_s2), find_nearest_value(NDVI_s2, date), NDVI_s2),
  ) %&gt;%
  ungroup() %&gt;%
  filter(!is.na(NDVI_l7)) %&gt;%
  select(-NDVI_l8, -NDVI_s2, -year) %&gt;%
  relocate(field_id, .after = last_col()) %&gt;%
  write_csv(file.path(results, &quot;merged_data_function_year.csv&quot;))

Note that I added an extra step to group by year in my code. This is because the resulting values weren't being normally distributed. My specific data is affected by seasonality and only pertains to the months of April through July. Restraining the operation of the function to a single year solved the problem.

答案2

得分: 0

以下是您提供的代码的中文翻译:

另一种仅使用基本R的变体:

d 作为您的示例数据:

d <- structure(list(date = structure(c(19419, 19420, 19422, 19425, 19426), class = "Date"), value = c(1L, NA, 4L, NA, 3L)), row.names = c(NA, 5L), class = "data.frame")

将列 date 转换为 Date 类型:

d$date <- as.Date(d$date, '%d.%m.%Y')

利用 dist 函数找到最近的邻居:

impute_from_neighbours <- function(values, dates){
  dists <- dist(dates) |> as.matrix()
  dists[dists == 0] <- NA
  na_pos <- which(is.na(values))
  closest_non_na_pos <- apply(dists[, na_pos], 2, which.min)
  values[na_pos] <- values[closest_non_na_pos]
  values
}
d$value <- impute_from_neighbours(d$value, d$date)

输出:

> d
        date value
1 2023-03-03     1
2 2023-03-04     1
3 2023-03-06     4
4 2023-03-09     3
5 2023-03-10     3

请注意,以上是您提供的代码的中文翻译部分,没有包含其他内容。

英文:

another variant with base R only:

d being your example data:

d &lt;- structure(list(date = structure(c(19419, 19420, 19422, 19425, 
19426), class = &quot;Date&quot;), value = c(1L, NA, 4L, NA, 3L)), row.names = c(NA, 
5L), class = &quot;data.frame&quot;)

convert column date to class Date:

d$date &lt;- as.Date(d$date, &#39;%d.%m.%Y&#39;)

exploit the distance function to find closest neighbour:

impute_from_neighbours &lt;- function(values, dates){
  dists &lt;- dist(dates) |&gt; as.matrix()
  dists[dists == 0] &lt;- NA
  na_pos &lt;- which(is.na(values))
  closest_non_na_pos &lt;- apply(dists[, na_pos], 2, which.min)
  values[na_pos] &lt;- values[closest_non_na_pos]
  values
}
d$value &lt;- impute_from_neighbours(d$value, d$date)

output:

&gt; d
        date value
1 2023-03-03     1
2 2023-03-04     1
3 2023-03-06     4
4 2023-03-09     3
5 2023-03-10     3

huangapple
  • 本文由 发表于 2023年6月12日 14:35:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76454103.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定