根据另一列中的重复数值和日期,合并两列中的数值。

huangapple go评论64阅读模式
英文:

merge values in 2 columns based on duplicated value and date in other column

问题

我有一个包含4列的数据框,第一列的值在整个数据框中确切出现了2次。第二列是日期,第三和第四列是布尔值。我想要实现的是合并重复项,并根据日期合并第三和第四列的值。看下面的例子就清楚了:

keys <- c("A","A","B","B","C","C")
dates <- as.POSIXct(c("2021-01-01", "2021-02-02","2021-03-03", "2021-04-04", "2021-05-05", "2021-06-06"))
vals1 <- c("T","F","T","F","F","F")
vals2 <- c("F","T","T","T","F","T")

df <- data.frame(key = keys,
                 date = dates,
                 val1 = vals1,
                 val2 = vals2)

这将产生以下数据框:

> df
  key       date val1 val2
1   A 2021-01-01    T    F
2   A 2021-02-02    F    T
3   B 2021-03-03    T    T
4   B 2021-04-04    F    T
5   C 2021-05-05    F    F
6   C 2021-06-06    F    T

现在我想创建一个包含3行的数据框;每个key值对应一行,其中val1是日期最早的val1值,val2是日期最大的val2值。我有一段代码可以产生这样的结果:

for (i in 1:nrow(df)){
  val1 <- df %>%
    filter(key == df$key[i]) %>%
    slice(which.min(date)) %>%
    ungroup() %>%
    select(val1)
  val2 <- df %>%
    filter(key == df$key[i]) %>%
    slice(which.max(date)) %>%
    ungroup() %>%
    select(val2)
  
  df$val1[df$key == df$key[i]] <- val1$val1
  df$val2[df$key == df$key[i]] <- val2$val2
}

df <- df[!duplicated(df$key),c("key","val1","val2")]

> df
  key val1 val2
1   A    T    T
3   B    T    T
5   C    F    T

请注意,在最终结果中,我对日期不感兴趣,只关心val1和val2的值。日期只是确定结果中应该取哪个val1或val2值的依据。这段代码对这个数据框效果很好,但我想在一个大数据框上应用它(~40k行)。这是个问题,因为我认为这是一个非常低效的方法,在如此大的数据框上运行它需要很长时间,而且我的数据框将来只会变得更大。我怀疑是由于for循环,但我目前还想不到解决这个问题的其他方法,所以提出了这个问题 根据另一列中的重复数值和日期,合并两列中的数值。 希望有人能帮助我,非常感谢。

Kind regards

英文:

I have a dataframe consisting of 4 columns, the values in the first column occur exactly 2 times throughout the dataframe. The second column is a date and the 3th and 4th column are booleans. What I would like to accomplish, is to merge the duplicates and merge the value in the 3th and 4th column based on the date. See my example below to clarify things;

keys &lt;- c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;)
dates &lt;- as.POSIXct(c(&quot;2021-01-01&quot;, &quot;2021-02-02&quot;,&quot;2021-03-03&quot;, &quot;2021-04-04&quot;, &quot;2021-05-05&quot;, &quot;2021-06-06&quot;))
vals1 &lt;- c(&quot;T&quot;,&quot;F&quot;,&quot;T&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;)
vals2 &lt;- c(&quot;F&quot;,&quot;T&quot;,&quot;T&quot;,&quot;T&quot;,&quot;F&quot;,&quot;T&quot;)

df &lt;- data.frame(key = keys,
                 date = dates,
                 val1 = vals1,
                 val2 = vals2)

This will produce the following dataframe

&gt; df
  key       date val1 val2
1   A 2021-01-01    T    F
2   A 2021-02-02    F    T
3   B 2021-03-03    T    T
4   B 2021-04-04    F    T
5   C 2021-05-05    F    F
6   C 2021-06-06    F    T

Now I want to create a dataframe with 3 rows; 1 for each key value where val1 is the val1 value where the date is the earliest, and a val 2 value where the date is the largest. I have a piece of code that produces such outcome;

for (i in 1:nrow(df)){
  val1 &lt;- df %&gt;%
    filter(key == df$key[i]) %&gt;%
    slice(which.min(date)) %&gt;%
    ungroup() %&gt;%
    select(val1)
  val2 &lt;- df %&gt;%
    filter(key == df$key[i]) %&gt;%
    slice(which.max(date)) %&gt;%
    ungroup() %&gt;%
    select(val2)
  
  df$val1[df$key == df$key[i]] &lt;- val1$val1
  df$val2[df$key == df$key[i]] &lt;- val2$val2
}

df &lt;- df[!duplicated(df$key),c(&quot;key&quot;,&quot;val1&quot;,&quot;val2&quot;)]

&gt; df
  key val1 val2
1   A    T    T
3   B    T    T
5   C    F    T

Note in the final result, I am not interested in the date, only in the val1 and val2 values. The date is just there to determine which val1 or val2 value should be taken in the result. This code works fine for this dataframe, but I want to apply this on a large dataframe (~40k rows).

This is a problem, because I think this is a very inefficient way to do this, applying it on such a large dataframe takes forever to compute and my dataframes will only get larger in the future. I suspect it is due the for loop, but I cannot think of another way to solve this problem yet, therefore my question 根据另一列中的重复数值和日期,合并两列中的数值。 Hope someone can help me out, all help is greatly appreciated.

Kind regards

答案1

得分: 3

df %>%
  按key分组 %>%
  汇总(val1 = val1[which.min(date)],
          val2 = val2[which.max(date)]) %>%
  解除分组()
英文:
df %&gt;% 
  group_by(key) %&gt;% 
  summarise(val1 = val1[which.min(date)], 
            val2 = val2[which.max(date)]) %&gt;%
  ungroup()

  key   val1  val2 
1 A     T     T    
2 B     T     T    
3 C     F     T   

答案2

得分: 2

这是另一种方法:

library(dplyr) # dplyr 1.1.0
df %>%
  summarise(val1 = val1[date == min(date)],
            val2 = val2[date == max(date)], .by=key)

  key val1 val2
1   A    T    T
2   B    T    T
3   C    F    T
英文:

Here is an alternative approach:

library(dplyr) # dplyr 1.1.0
df %&gt;%
  summarise(val1 = val1[date == min(date)],
            val2 = val2[date == max(date)], .by=key)

  key val1 val2
1   A    T    T
2   B    T    T
3   C    F    T

huangapple
  • 本文由 发表于 2023年3月8日 17:45:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75671468.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定