2023年4月13日 15:33:23go评论92阅读模式

英文:

Merge dataframes without duplicate rows with NA values

问题

你可以使用以下代码来合并数据框并消除NA值以获得df2：

df2 <- a %>%
  full_join(b, by = "hour") %>%
  full_join(c, by = "hour")

这将按照小时(hour)列连接数据框a，b和c，并消除了不需要的NA值。

英文:

I have data from multiple files that have data hourly observations for different variables. I'm trying to combine them to a single dataframe where I have every observation for each variable for the whole time interval. Some variables have data in multiple dataframes for different time intervals. Each dataframe has a time column that can be used for joining.

The problem is that full_join creates more rows than my data has hours (df1). Instead I would like to get a dataframe (df2) without NA values and extra rows. One solution is to join the dataframes in specific order but I'm hoping for a more general solution that works with larger scale for combining datasets for this type of data.

library(tidyverse)
a &lt;- data.frame(hour = 1:10, x = 5)
b &lt;- data.frame(hour = 1:15, y = 15)
c &lt;- data.frame(hour = 11:20, x = 10)

list &lt;- list(a,b,c)

df1 &lt;- list %&gt;%
  reduce(full_join)

df1
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 NA 15
12   12 NA 15
13   13 NA 15
14   14 NA 15
15   15 NA 15
16   11 10 NA
17   12 10 NA
18   13 10 NA
19   14 10 NA
20   15 10 NA
21   16 10 NA
22   17 10 NA
23   18 10 NA
24   19 10 NA
25   20 10 NA

df2 &lt;- full_join(a,c) %&gt;%
  full_join(b)

df2
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 10 15
12   12 10 15
13   13 10 15
14   14 10 15
15   15 10 15
16   16 10 NA
17   17 10 NA
18   18 10 NA
19   19 10 NA
20   20 10 NA

EDIT:
I got the correct result with tidyverse with following code:

df &lt;- list %&gt;%
  reduce(full_join) %&gt;%
  group_by(hour) %&gt;%
  summarise_all(mean, na.rm = T)

With data.table meltand dcast also seem to work

dt1 &lt;- as.data.table(df1)
m &lt;- as.data.table(melt(dt1, id.vars = &quot;hour&quot;))
dc &lt;- dcast.data.table(m, hour~variable, fun.aggregate = mean, na.rm = T)
df &lt;- as.data.frame(dc)

I guess mean can be replaced with some other function but I couldn't figure which.

答案1

得分: 1

First, you need to specify the by keyword in join.

Here is a solution using coalesce_join from E. Visel blog post. I slightly modified it to use coalesce only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced dataframe is empty.

library(tidyverse)

a &lt;- data.frame(hour = 1:10, x = 5)
b &lt;- data.frame(hour = 1:15, y = 15)
c &lt;- data.frame(hour = 11:20, x = 10)

l &lt;- list(a,b,c)

coalesce_join &lt;- function(x, y, 
                              by = NULL, suffix = c(&quot;.x&quot;, &quot;.y&quot;), 
                              join = dplyr::full_join, ...) {
        joined &lt;- join(x, y, by = by, suffix = suffix, ...)
        # names of desired output
        cols &lt;- union(names(x), names(y))
        common_cols &lt;- setdiff(intersect(names(x), names(y)), by)
        if(length common_cols &gt; 0) {
          # if there are common columns, we need to coalesce them
          to_coalesce &lt;- names(joined)[!names(joined) %in% cols]
          suffix_used &lt;- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
          # remove suffixes and deduplicate
          to_coalesce &lt;- unique(substr(
              to_coalesce, 
              1, 
              nchar(to_coalesce) - nchar(suffix_used)
          ))
          
          coalesced &lt;- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
              joined[[paste0(.x, suffix[1])]], 
              joined[[paste0(.x, suffix[2])]]
          ))
          names(coalesced) &lt;- to_coalesce

          joined &lt;- dplyr::bind_cols(joined, coalesced)[cols]
      }
      return(joined)
}

reduce(l, coalesce_join, by=&quot;hour&quot;)

This snippet gives me the desired output without caring for the column names.

NB: According to the documentation, coalesce finds the first non-missing element, so you have to be aware of this behavior if there are overlapping values of x in several dataframes.

NB: I renamed l to be your list object. I believe it is not good for code clarity to overload language-specific keywords.

英文:

First, you need to specify the by keyword in join.
Here is a solution using coalesce_join from E. Visel blog post. I slightly modified it to use coalesce only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced dataframe is empty.

library(tidyverse)

a &lt;- data.frame(hour = 1:10, x = 5)
b &lt;- data.frame(hour = 1:15, y = 15)
c &lt;- data.frame(hour = 11:20, x = 10)

l &lt;- list(a,b,c)

coalesce_join &lt;- function(x, y, 
                              by = NULL, suffix = c(&quot;.x&quot;, &quot;.y&quot;), 
                              join = dplyr::full_join, ...) {
        joined &lt;- join(x, y, by = by, suffix = suffix, ...)
        # names of desired output
        cols &lt;- union(names(x), names(y))
        common_cols &lt;- setdiff(intersect(names(x), names(y)), by)
        if(length(common_cols) &gt; 0) {
          # if there are common columns, we need to coalesce them
          to_coalesce &lt;- names(joined)[!names(joined) %in% cols]
          suffix_used &lt;- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
          # remove suffixes and deduplicate
          to_coalesce &lt;- unique(substr(
              to_coalesce, 
              1, 
              nchar(to_coalesce) - nchar(suffix_used)
          ))
          
          coalesced &lt;- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
              joined[[paste0(.x, suffix[1])]], 
              joined[[paste0(.x, suffix[2])]]
          ))
          names(coalesced) &lt;- to_coalesce

          joined &lt;- dplyr::bind_cols(joined, coalesced)[cols]
      }
      return(joined)
}

reduce(l, coalesce_join, by=&quot;hour&quot;)

This snippet gives me the desired output without caring for the column names.

NB: According to the documentation, coalesce finds the first non-missing element, so you have to be aware of this behaviour if there are overlapping values of x in several dataframes.

NB: I renamed l to be your list object.
I believe it is not good for code clarity to overload language-specific keywords.

答案2

得分: 1

以下是已翻译的内容：

这里有一个版本，其中主键可以是复合主键（例如 hour 和 group）。与您自己添加到问题中的聚合方式不同，它不能用于聚合。

识别（复合）主键

mykeys <- c("hour")

使用data.table包

list %>% lapply(setDT)

创建一个包含所有键的唯一组合以加入的数据集

full <- lapply(list, function(dt) dt[, ..mykeys]) %>%
rbindlist %>%
unique

可以将键作为完整数据集上的实际键，但不是必需的

setkeyv(full,mykeys)

将所有变量加入完整数据集。

重复项的保留值来自列表对象的最后一个数据集的顺序。这一步是通过引用完成的。

list %>% lapply(function(dt) {
n <- names(dt)
full[dt,
on = mykeys,
(n) := mget(paste0("i.", n))]
})
full

这是一个具有复合主键和更多列以帮助识别正在进行的操作的示例数据集。再次注意，列表中数据集的顺序很重要。

mykeys <- c("hour"
,"group"
)

a <- data.frame(hour = 1:10,
group = 1,
x = 5, z = 1, a = 1)
b <- data.frame(hour = 1:15,
group = 1,
y = 15, z = 2, b = 1)
c <- data.frame(hour = 11:20,
group = 1,
# group = 2,
x = 10, z = 3, c = 1)

英文:

Here is a version where the primary key can be a composite (for example hour and group). It doesn't work with aggregating like your own solution you added to the question.

## Identify (composite) primary key 
mykeys &lt;- c(&quot;hour&quot;)

## Using the data.table package below
list %&gt;%
    lapply(setDT)

## Create a dataset with all unique combinations of the keys to join onto
full &lt;- lapply(list, function(dt) dt[,..mykeys]) %&gt;%
    rbindlist %&gt;%
    unique

## Can make the keys as actual keys on the full dataset but not necessary
# setkeyv(full,mykeys)

## Join all variables onto the full dataset.
## The preserved values of duplicates are from the last dataset in the order of 
## the list object. This step is done by reference.
list %&gt;%
    lapply(function(dt) {
        n &lt;- names(dt)
        full[dt,
             on = mykeys,
             (n) := mget(paste0(&quot;i.&quot;, n))]
    })
full

Here is an example dataset with a composite primary key and more columns to help identify what is going on. Again, note that the order of the datasets in list matter.

mykeys &lt;- c(&quot;hour&quot;
            ,&quot;group&quot;
)

a &lt;- data.frame(hour = 1:10, 
                group = 1,
                x = 5, z = 1, a = 1)
b &lt;- data.frame(hour = 1:15, 
                group = 1,
                y = 15, z = 2, b = 1)
c &lt;- data.frame(hour = 11:20, 
                group = 1,
                # group = 2,
                x = 10, z = 3, c = 1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并数据框，不包括重复行和带有NA值的行。

问题

答案1

答案2

识别（复合）主键

使用data.table包

创建一个包含所有键的唯一组合以加入的数据集

可以将键作为完整数据集上的实际键，但不是必需的

setkeyv(full,mykeys)

将所有变量加入完整数据集。

重复项的保留值来自列表对象的最后一个数据集的顺序。这一步是通过引用完成的。

如何在同一个代码块中使用dplyr的rename和mutate动词。

ggplot指定经度/纬度轴刻度值

在 R 中找到一条道路的长度。

在R中，通过”S.NO”分组收集数据时，在数据的开头和结尾添加”Age”的NA值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论