2023年3月8日 17:45:29go评论91阅读模式

英文:

merge values in 2 columns based on duplicated value and date in other column

问题

我有一个包含4列的数据框，第一列的值在整个数据框中确切出现了2次。第二列是日期，第三和第四列是布尔值。我想要实现的是合并重复项，并根据日期合并第三和第四列的值。看下面的例子就清楚了：

keys <- c("A","A","B","B","C","C")
dates <- as.POSIXct(c("2021-01-01", "2021-02-02","2021-03-03", "2021-04-04", "2021-05-05", "2021-06-06"))
vals1 <- c("T","F","T","F","F","F")
vals2 <- c("F","T","T","T","F","T")
df <- data.frame(key = keys,
                 date = dates,
                 val1 = vals1,
                 val2 = vals2)

这将产生以下数据框：

> df
  key       date val1 val2
1   A 2021-01-01    T    F
2   A 2021-02-02    F    T
3   B 2021-03-03    T    T
4   B 2021-04-04    F    T
5   C 2021-05-05    F    F
6   C 2021-06-06    F    T

现在我想创建一个包含3行的数据框；每个key值对应一行，其中val1是日期最早的val1值，val2是日期最大的val2值。我有一段代码可以产生这样的结果：

for (i in 1:nrow(df)){
  val1 <- df %>%
    filter(key == df$key[i]) %>%
    slice(which.min(date)) %>%
    ungroup() %>%
    select(val1)
  val2 <- df %>%
    filter(key == df$key[i]) %>%
    slice(which.max(date)) %>%
    ungroup() %>%
    select(val2)
  
  df$val1[df$key == df$key[i]] <- val1$val1
  df$val2[df$key == df$key[i]] <- val2$val2
}
df <- df[!duplicated(df$key),c("key","val1","val2")]
> df
  key val1 val2
1   A    T    T
3   B    T    T
5   C    F    T

请注意，在最终结果中，我对日期不感兴趣，只关心val1和val2的值。日期只是确定结果中应该取哪个val1或val2值的依据。这段代码对这个数据框效果很好，但我想在一个大数据框上应用它（~40k行）。这是个问题，因为我认为这是一个非常低效的方法，在如此大的数据框上运行它需要很长时间，而且我的数据框将来只会变得更大。我怀疑是由于for循环，但我目前还想不到解决这个问题的其他方法，所以提出了这个问题希望有人能帮助我，非常感谢。

Kind regards

英文:

I have a dataframe consisting of 4 columns, the values in the first column occur exactly 2 times throughout the dataframe. The second column is a date and the 3th and 4th column are booleans. What I would like to accomplish, is to merge the duplicates and merge the value in the 3th and 4th column based on the date. See my example below to clarify things;

keys &lt;- c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;)
dates &lt;- as.POSIXct(c(&quot;2021-01-01&quot;, &quot;2021-02-02&quot;,&quot;2021-03-03&quot;, &quot;2021-04-04&quot;, &quot;2021-05-05&quot;, &quot;2021-06-06&quot;))
vals1 &lt;- c(&quot;T&quot;,&quot;F&quot;,&quot;T&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;)
vals2 &lt;- c(&quot;F&quot;,&quot;T&quot;,&quot;T&quot;,&quot;T&quot;,&quot;F&quot;,&quot;T&quot;)
df &lt;- data.frame(key = keys,
                 date = dates,
                 val1 = vals1,
                 val2 = vals2)

This will produce the following dataframe

&gt; df
  key       date val1 val2
1   A 2021-01-01    T    F
2   A 2021-02-02    F    T
3   B 2021-03-03    T    T
4   B 2021-04-04    F    T
5   C 2021-05-05    F    F
6   C 2021-06-06    F    T

Now I want to create a dataframe with 3 rows; 1 for each key value where val1 is the val1 value where the date is the earliest, and a val 2 value where the date is the largest. I have a piece of code that produces such outcome;

for (i in 1:nrow(df)){
  val1 &lt;- df %&gt;%
    filter(key == df$key[i]) %&gt;%
    slice(which.min(date)) %&gt;%
    ungroup() %&gt;%
    select(val1)
  val2 &lt;- df %&gt;%
    filter(key == df$key[i]) %&gt;%
    slice(which.max(date)) %&gt;%
    ungroup() %&gt;%
    select(val2)
  
  df$val1[df$key == df$key[i]] &lt;- val1$val1
  df$val2[df$key == df$key[i]] &lt;- val2$val2
}
df &lt;- df[!duplicated(df$key),c(&quot;key&quot;,&quot;val1&quot;,&quot;val2&quot;)]
&gt; df
  key val1 val2
1   A    T    T
3   B    T    T
5   C    F    T

Note in the final result, I am not interested in the date, only in the val1 and val2 values. The date is just there to determine which val1 or val2 value should be taken in the result. This code works fine for this dataframe, but I want to apply this on a large dataframe (~40k rows).

This is a problem, because I think this is a very inefficient way to do this, applying it on such a large dataframe takes forever to compute and my dataframes will only get larger in the future. I suspect it is due the for loop, but I cannot think of another way to solve this problem yet, therefore my question Hope someone can help me out, all help is greatly appreciated.

Kind regards

答案1

得分: 3

df %>%
  按key分组 %>%
  汇总(val1 = val1[which.min(date)],
          val2 = val2[which.max(date)]) %>%
  解除分组()

英文:

df %&gt;% 
  group_by(key) %&gt;% 
  summarise(val1 = val1[which.min(date)], 
            val2 = val2[which.max(date)]) %&gt;%
  ungroup()
  key   val1  val2 
1 A     T     T    
2 B     T     T    
3 C     F     T

答案2

得分: 2

这是另一种方法：

library(dplyr) # dplyr 1.1.0
df %>%
  summarise(val1 = val1[date == min(date)],
            val2 = val2[date == max(date)], .by=key)
  key val1 val2
1   A    T    T
2   B    T    T
3   C    F    T

英文:

Here is an alternative approach:

library(dplyr) # dplyr 1.1.0
df %&gt;%
  summarise(val1 = val1[date == min(date)],
            val2 = val2[date == max(date)], .by=key)
  key val1 val2
1   A    T    T
2   B    T    T
3   C    F    T

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据另一列中的重复数值和日期，合并两列中的数值。

问题

答案1

答案2

基于条件在 Polars 中选择列：

向量中的R索引变量未按预期工作。

将 ggplot2 图像按X轴等比例保存

如何在R中复制生存分析并获得与Stata中获得的完全相同的标准误差？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。