合并数据框,不包括重复行和带有NA值的行。

huangapple go评论79阅读模式
英文:

Merge dataframes without duplicate rows with NA values

问题

你可以使用以下代码来合并数据框并消除NA值以获得df2

df2 <- a %>%
  full_join(b, by = "hour") %>%
  full_join(c, by = "hour")

这将按照小时(hour)列连接数据框abc,并消除了不需要的NA值。

英文:

I have data from multiple files that have data hourly observations for different variables. I'm trying to combine them to a single dataframe where I have every observation for each variable for the whole time interval. Some variables have data in multiple dataframes for different time intervals. Each dataframe has a time column that can be used for joining.

The problem is that full_join creates more rows than my data has hours (df1). Instead I would like to get a dataframe (df2) without NA values and extra rows. One solution is to join the dataframes in specific order but I'm hoping for a more general solution that works with larger scale for combining datasets for this type of data.

library(tidyverse)
a &lt;- data.frame(hour = 1:10, x = 5)
b &lt;- data.frame(hour = 1:15, y = 15)
c &lt;- data.frame(hour = 11:20, x = 10)

list &lt;- list(a,b,c)

df1 &lt;- list %&gt;%
  reduce(full_join)

df1
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 NA 15
12   12 NA 15
13   13 NA 15
14   14 NA 15
15   15 NA 15
16   11 10 NA
17   12 10 NA
18   13 10 NA
19   14 10 NA
20   15 10 NA
21   16 10 NA
22   17 10 NA
23   18 10 NA
24   19 10 NA
25   20 10 NA

df2 &lt;- full_join(a,c) %&gt;%
  full_join(b)

df2
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 10 15
12   12 10 15
13   13 10 15
14   14 10 15
15   15 10 15
16   16 10 NA
17   17 10 NA
18   18 10 NA
19   19 10 NA
20   20 10 NA


EDIT:
I got the correct result with tidyverse with following code:

df &lt;- list %&gt;%
  reduce(full_join) %&gt;%
  group_by(hour) %&gt;%
  summarise_all(mean, na.rm = T)

With data.table meltand dcast also seem to work

dt1 &lt;- as.data.table(df1)
m &lt;- as.data.table(melt(dt1, id.vars = &quot;hour&quot;))
dc &lt;- dcast.data.table(m, hour~variable, fun.aggregate = mean, na.rm = T)
df &lt;- as.data.frame(dc)

I guess mean can be replaced with some other function but I couldn't figure which.

答案1

得分: 1

First, you need to specify the by keyword in join.

Here is a solution using coalesce_join from E. Visel blog post. I slightly modified it to use coalesce only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced dataframe is empty.

library(tidyverse)

a &lt;- data.frame(hour = 1:10, x = 5)
b &lt;- data.frame(hour = 1:15, y = 15)
c &lt;- data.frame(hour = 11:20, x = 10)

l &lt;- list(a,b,c)

coalesce_join &lt;- function(x, y, 
                              by = NULL, suffix = c(&quot;.x&quot;, &quot;.y&quot;), 
                              join = dplyr::full_join, ...) {
        joined &lt;- join(x, y, by = by, suffix = suffix, ...)
        # names of desired output
        cols &lt;- union(names(x), names(y))
        common_cols &lt;- setdiff(intersect(names(x), names(y)), by)
        if(length common_cols &gt; 0) {
          # if there are common columns, we need to coalesce them
          to_coalesce &lt;- names(joined)[!names(joined) %in% cols]
          suffix_used &lt;- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
          # remove suffixes and deduplicate
          to_coalesce &lt;- unique(substr(
              to_coalesce, 
              1, 
              nchar(to_coalesce) - nchar(suffix_used)
          ))
          
          coalesced &lt;- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
              joined[[paste0(.x, suffix[1])]], 
              joined[[paste0(.x, suffix[2])]]
          ))
          names(coalesced) &lt;- to_coalesce

          joined &lt;- dplyr::bind_cols(joined, coalesced)[cols]
      }
      return(joined)
}

reduce(l, coalesce_join, by=&quot;hour&quot;)

This snippet gives me the desired output without caring for the column names.

NB: According to the documentation, coalesce finds the first non-missing element, so you have to be aware of this behavior if there are overlapping values of x in several dataframes.

NB: I renamed l to be your list object. I believe it is not good for code clarity to overload language-specific keywords.

英文:

First, you need to specify the by keyword in join.
Here is a solution using coalesce_join from E. Visel blog post. I slightly modified it to use coalesce only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced dataframe is empty.

library(tidyverse)

a &lt;- data.frame(hour = 1:10, x = 5)
b &lt;- data.frame(hour = 1:15, y = 15)
c &lt;- data.frame(hour = 11:20, x = 10)

l &lt;- list(a,b,c)

coalesce_join &lt;- function(x, y, 
                              by = NULL, suffix = c(&quot;.x&quot;, &quot;.y&quot;), 
                              join = dplyr::full_join, ...) {
        joined &lt;- join(x, y, by = by, suffix = suffix, ...)
        # names of desired output
        cols &lt;- union(names(x), names(y))
        common_cols &lt;- setdiff(intersect(names(x), names(y)), by)
        if(length(common_cols) &gt; 0) {
          # if there are common columns, we need to coalesce them
          to_coalesce &lt;- names(joined)[!names(joined) %in% cols]
          suffix_used &lt;- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
          # remove suffixes and deduplicate
          to_coalesce &lt;- unique(substr(
              to_coalesce, 
              1, 
              nchar(to_coalesce) - nchar(suffix_used)
          ))
          
          coalesced &lt;- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
              joined[[paste0(.x, suffix[1])]], 
              joined[[paste0(.x, suffix[2])]]
          ))
          names(coalesced) &lt;- to_coalesce

          joined &lt;- dplyr::bind_cols(joined, coalesced)[cols]
      }
      return(joined)
}

reduce(l, coalesce_join, by=&quot;hour&quot;)

This snippet gives me the desired output without caring for the column names.

NB: According to the documentation, coalesce finds the first non-missing element, so you have to be aware of this behaviour if there are overlapping values of x in several dataframes.


NB: I renamed l to be your list object.
I believe it is not good for code clarity to overload language-specific keywords.

答案2

得分: 1

以下是已翻译的内容:

这里有一个版本,其中主键可以是复合主键(例如 hourgroup)。与您自己添加到问题中的聚合方式不同,它不能用于聚合。

识别(复合)主键

mykeys <- c("hour")

使用data.table包

list %>% lapply(setDT)

创建一个包含所有键的唯一组合以加入的数据集

full <- lapply(list, function(dt) dt[, ..mykeys]) %>%
rbindlist %>%
unique

可以将键作为完整数据集上的实际键,但不是必需的

setkeyv(full,mykeys)

将所有变量加入完整数据集。

重复项的保留值来自列表对象的最后一个数据集的顺序。这一步是通过引用完成的。

list %>% lapply(function(dt) {
n <- names(dt)
full[dt,
on = mykeys,
(n) := mget(paste0("i.", n))]
})
full

这是一个具有复合主键和更多列以帮助识别正在进行的操作的示例数据集。再次注意,列表中数据集的顺序很重要。

mykeys <- c("hour"
,"group"
)

a <- data.frame(hour = 1:10,
group = 1,
x = 5, z = 1, a = 1)
b <- data.frame(hour = 1:15,
group = 1,
y = 15, z = 2, b = 1)
c <- data.frame(hour = 11:20,
group = 1,
# group = 2,
x = 10, z = 3, c = 1)

英文:

Here is a version where the primary key can be a composite (for example hour and group). It doesn't work with aggregating like your own solution you added to the question.

## Identify (composite) primary key 
mykeys &lt;- c(&quot;hour&quot;)

## Using the data.table package below
list %&gt;%
    lapply(setDT)

## Create a dataset with all unique combinations of the keys to join onto
full &lt;- lapply(list, function(dt) dt[,..mykeys]) %&gt;%
    rbindlist %&gt;%
    unique

## Can make the keys as actual keys on the full dataset but not necessary
# setkeyv(full,mykeys)

## Join all variables onto the full dataset.
## The preserved values of duplicates are from the last dataset in the order of 
## the list object. This step is done by reference.
list %&gt;%
    lapply(function(dt) {
        n &lt;- names(dt)
        full[dt,
             on = mykeys,
             (n) := mget(paste0(&quot;i.&quot;, n))]
    })
full

Here is an example dataset with a composite primary key and more columns to help identify what is going on. Again, note that the order of the datasets in list matter.

mykeys &lt;- c(&quot;hour&quot;
            ,&quot;group&quot;
)

a &lt;- data.frame(hour = 1:10, 
                group = 1,
                x = 5, z = 1, a = 1)
b &lt;- data.frame(hour = 1:15, 
                group = 1,
                y = 15, z = 2, b = 1)
c &lt;- data.frame(hour = 11:20, 
                group = 1,
                # group = 2,
                x = 10, z = 3, c = 1)

huangapple
  • 本文由 发表于 2023年4月13日 15:33:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76002775.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定