英文:
Merge dataframes without duplicate rows with NA values
问题
你可以使用以下代码来合并数据框并消除NA值以获得df2
:
df2 <- a %>%
full_join(b, by = "hour") %>%
full_join(c, by = "hour")
这将按照小时(hour
)列连接数据框a
,b
和c
,并消除了不需要的NA值。
英文:
I have data from multiple files that have data hourly observations for different variables. I'm trying to combine them to a single dataframe where I have every observation for each variable for the whole time interval. Some variables have data in multiple dataframes for different time intervals. Each dataframe has a time column that can be used for joining.
The problem is that full_join
creates more rows than my data has hours (df1
). Instead I would like to get a dataframe (df2
) without NA
values and extra rows. One solution is to join the dataframes in specific order but I'm hoping for a more general solution that works with larger scale for combining datasets for this type of data.
library(tidyverse)
a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)
list <- list(a,b,c)
df1 <- list %>%
reduce(full_join)
df1
hour x y
1 1 5 15
2 2 5 15
3 3 5 15
4 4 5 15
5 5 5 15
6 6 5 15
7 7 5 15
8 8 5 15
9 9 5 15
10 10 5 15
11 11 NA 15
12 12 NA 15
13 13 NA 15
14 14 NA 15
15 15 NA 15
16 11 10 NA
17 12 10 NA
18 13 10 NA
19 14 10 NA
20 15 10 NA
21 16 10 NA
22 17 10 NA
23 18 10 NA
24 19 10 NA
25 20 10 NA
df2 <- full_join(a,c) %>%
full_join(b)
df2
hour x y
1 1 5 15
2 2 5 15
3 3 5 15
4 4 5 15
5 5 5 15
6 6 5 15
7 7 5 15
8 8 5 15
9 9 5 15
10 10 5 15
11 11 10 15
12 12 10 15
13 13 10 15
14 14 10 15
15 15 10 15
16 16 10 NA
17 17 10 NA
18 18 10 NA
19 19 10 NA
20 20 10 NA
EDIT:
I got the correct result with tidyverse with following code:
df <- list %>%
reduce(full_join) %>%
group_by(hour) %>%
summarise_all(mean, na.rm = T)
With data.table melt
and dcast
also seem to work
dt1 <- as.data.table(df1)
m <- as.data.table(melt(dt1, id.vars = "hour"))
dc <- dcast.data.table(m, hour~variable, fun.aggregate = mean, na.rm = T)
df <- as.data.frame(dc)
I guess mean
can be replaced with some other function but I couldn't figure which.
答案1
得分: 1
First, you need to specify the by
keyword in join.
Here is a solution using coalesce_join
from E. Visel blog post. I slightly modified it to use coalesce
only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced
dataframe is empty.
library(tidyverse)
a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)
l <- list(a,b,c)
coalesce_join <- function(x, y,
by = NULL, suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
common_cols <- setdiff(intersect(names(x), names(y)), by)
if(length common_cols > 0) {
# if there are common columns, we need to coalesce them
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
joined <- dplyr::bind_cols(joined, coalesced)[cols]
}
return(joined)
}
reduce(l, coalesce_join, by="hour")
This snippet gives me the desired output without caring for the column names.
NB: According to the documentation, coalesce
finds the first non-missing element, so you have to be aware of this behavior if there are overlapping values of x
in several dataframes.
NB: I renamed l
to be your list
object. I believe it is not good for code clarity to overload language-specific keywords.
英文:
First, you need to specify the by
keyword in join.
Here is a solution using coalesce_join
from E. Visel blog post. I slightly modified it to use coalesce
only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced
dataframe is empty.
library(tidyverse)
a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)
l <- list(a,b,c)
coalesce_join <- function(x, y,
by = NULL, suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
common_cols <- setdiff(intersect(names(x), names(y)), by)
if(length(common_cols) > 0) {
# if there are common columns, we need to coalesce them
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
joined <- dplyr::bind_cols(joined, coalesced)[cols]
}
return(joined)
}
reduce(l, coalesce_join, by="hour")
This snippet gives me the desired output without caring for the column names.
NB: According to the documentation, coalesce
finds the first non-missing element, so you have to be aware of this behaviour if there are overlapping values of x
in several dataframes.
NB: I renamed l
to be your list
object.
I believe it is not good for code clarity to overload language-specific keywords.
答案2
得分: 1
以下是已翻译的内容:
这里有一个版本,其中主键可以是复合主键(例如 hour
和 group
)。与您自己添加到问题中的聚合方式不同,它不能用于聚合。
识别(复合)主键
mykeys <- c("hour")
使用data.table包
list %>% lapply(setDT)
创建一个包含所有键的唯一组合以加入的数据集
full <- lapply(list, function(dt) dt[, ..mykeys]) %>%
rbindlist %>%
unique
可以将键作为完整数据集上的实际键,但不是必需的
setkeyv(full,mykeys)
将所有变量加入完整数据集。
重复项的保留值来自列表对象的最后一个数据集的顺序。这一步是通过引用完成的。
list %>% lapply(function(dt) {
n <- names(dt)
full[dt,
on = mykeys,
(n) := mget(paste0("i.", n))]
})
full
这是一个具有复合主键和更多列以帮助识别正在进行的操作的示例数据集。再次注意,列表中数据集的顺序很重要。
mykeys <- c("hour"
,"group"
)
a <- data.frame(hour = 1:10,
group = 1,
x = 5, z = 1, a = 1)
b <- data.frame(hour = 1:15,
group = 1,
y = 15, z = 2, b = 1)
c <- data.frame(hour = 11:20,
group = 1,
# group = 2,
x = 10, z = 3, c = 1)
英文:
Here is a version where the primary key can be a composite (for example hour
and group
). It doesn't work with aggregating like your own solution you added to the question.
## Identify (composite) primary key
mykeys <- c("hour")
## Using the data.table package below
list %>%
lapply(setDT)
## Create a dataset with all unique combinations of the keys to join onto
full <- lapply(list, function(dt) dt[,..mykeys]) %>%
rbindlist %>%
unique
## Can make the keys as actual keys on the full dataset but not necessary
# setkeyv(full,mykeys)
## Join all variables onto the full dataset.
## The preserved values of duplicates are from the last dataset in the order of
## the list object. This step is done by reference.
list %>%
lapply(function(dt) {
n <- names(dt)
full[dt,
on = mykeys,
(n) := mget(paste0("i.", n))]
})
full
Here is an example dataset with a composite primary key and more columns to help identify what is going on. Again, note that the order of the datasets in list
matter.
mykeys <- c("hour"
,"group"
)
a <- data.frame(hour = 1:10,
group = 1,
x = 5, z = 1, a = 1)
b <- data.frame(hour = 1:15,
group = 1,
y = 15, z = 2, b = 1)
c <- data.frame(hour = 11:20,
group = 1,
# group = 2,
x = 10, z = 3, c = 1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论