英文:
merge values in 2 columns based on duplicated value and date in other column
问题
我有一个包含4列的数据框,第一列的值在整个数据框中确切出现了2次。第二列是日期,第三和第四列是布尔值。我想要实现的是合并重复项,并根据日期合并第三和第四列的值。看下面的例子就清楚了:
keys <- c("A","A","B","B","C","C")
dates <- as.POSIXct(c("2021-01-01", "2021-02-02","2021-03-03", "2021-04-04", "2021-05-05", "2021-06-06"))
vals1 <- c("T","F","T","F","F","F")
vals2 <- c("F","T","T","T","F","T")
df <- data.frame(key = keys,
date = dates,
val1 = vals1,
val2 = vals2)
这将产生以下数据框:
> df
key date val1 val2
1 A 2021-01-01 T F
2 A 2021-02-02 F T
3 B 2021-03-03 T T
4 B 2021-04-04 F T
5 C 2021-05-05 F F
6 C 2021-06-06 F T
现在我想创建一个包含3行的数据框;每个key值对应一行,其中val1是日期最早的val1值,val2是日期最大的val2值。我有一段代码可以产生这样的结果:
for (i in 1:nrow(df)){
val1 <- df %>%
filter(key == df$key[i]) %>%
slice(which.min(date)) %>%
ungroup() %>%
select(val1)
val2 <- df %>%
filter(key == df$key[i]) %>%
slice(which.max(date)) %>%
ungroup() %>%
select(val2)
df$val1[df$key == df$key[i]] <- val1$val1
df$val2[df$key == df$key[i]] <- val2$val2
}
df <- df[!duplicated(df$key),c("key","val1","val2")]
> df
key val1 val2
1 A T T
3 B T T
5 C F T
请注意,在最终结果中,我对日期不感兴趣,只关心val1和val2的值。日期只是确定结果中应该取哪个val1或val2值的依据。这段代码对这个数据框效果很好,但我想在一个大数据框上应用它(~40k行)。这是个问题,因为我认为这是一个非常低效的方法,在如此大的数据框上运行它需要很长时间,而且我的数据框将来只会变得更大。我怀疑是由于for循环,但我目前还想不到解决这个问题的其他方法,所以提出了这个问题 希望有人能帮助我,非常感谢。
Kind regards
英文:
I have a dataframe consisting of 4 columns, the values in the first column occur exactly 2 times throughout the dataframe. The second column is a date and the 3th and 4th column are booleans. What I would like to accomplish, is to merge the duplicates and merge the value in the 3th and 4th column based on the date. See my example below to clarify things;
keys <- c("A","A","B","B","C","C")
dates <- as.POSIXct(c("2021-01-01", "2021-02-02","2021-03-03", "2021-04-04", "2021-05-05", "2021-06-06"))
vals1 <- c("T","F","T","F","F","F")
vals2 <- c("F","T","T","T","F","T")
df <- data.frame(key = keys,
date = dates,
val1 = vals1,
val2 = vals2)
This will produce the following dataframe
> df
key date val1 val2
1 A 2021-01-01 T F
2 A 2021-02-02 F T
3 B 2021-03-03 T T
4 B 2021-04-04 F T
5 C 2021-05-05 F F
6 C 2021-06-06 F T
Now I want to create a dataframe with 3 rows; 1 for each key value where val1 is the val1 value where the date is the earliest, and a val 2 value where the date is the largest. I have a piece of code that produces such outcome;
for (i in 1:nrow(df)){
val1 <- df %>%
filter(key == df$key[i]) %>%
slice(which.min(date)) %>%
ungroup() %>%
select(val1)
val2 <- df %>%
filter(key == df$key[i]) %>%
slice(which.max(date)) %>%
ungroup() %>%
select(val2)
df$val1[df$key == df$key[i]] <- val1$val1
df$val2[df$key == df$key[i]] <- val2$val2
}
df <- df[!duplicated(df$key),c("key","val1","val2")]
> df
key val1 val2
1 A T T
3 B T T
5 C F T
Note in the final result, I am not interested in the date, only in the val1 and val2 values. The date is just there to determine which val1 or val2 value should be taken in the result. This code works fine for this dataframe, but I want to apply this on a large dataframe (~40k rows).
This is a problem, because I think this is a very inefficient way to do this, applying it on such a large dataframe takes forever to compute and my dataframes will only get larger in the future. I suspect it is due the for loop, but I cannot think of another way to solve this problem yet, therefore my question Hope someone can help me out, all help is greatly appreciated.
Kind regards
答案1
得分: 3
df %>%
按key分组 %>%
汇总(val1 = val1[which.min(date)],
val2 = val2[which.max(date)]) %>%
解除分组()
英文:
df %>%
group_by(key) %>%
summarise(val1 = val1[which.min(date)],
val2 = val2[which.max(date)]) %>%
ungroup()
key val1 val2
1 A T T
2 B T T
3 C F T
答案2
得分: 2
这是另一种方法:
library(dplyr) # dplyr 1.1.0
df %>%
summarise(val1 = val1[date == min(date)],
val2 = val2[date == max(date)], .by=key)
key val1 val2
1 A T T
2 B T T
3 C F T
英文:
Here is an alternative approach:
library(dplyr) # dplyr 1.1.0
df %>%
summarise(val1 = val1[date == min(date)],
val2 = val2[date == max(date)], .by=key)
key val1 val2
1 A T T
2 B T T
3 C F T
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论