英文:
Elegant way to merge many duplicated columns in R?
问题
我有两个数据框需要在R中合并。在合并之前,我不能真正修改它们,它们共享许多列,但在重复的列中没有重叠值,如果一个列中的一行有一个值,那么重复的列中相同的行就是NA。合并后,我有以下结果:
df1 <- data.frame(id = c(1,2,3,4),
Var1 = c(1,2,NA,NA),
Var2 = c("a","b",NA,NA))
df2 <- data.frame(id = c(1,2,3,4),
Var1 = c(NA,NA,3,4),
Var2 = c(NA,NA,"c","d"))
mergedData <- merge(df1, df2, by = "id")
mergedData <- mergedData[, c("Var1.x", "Var1.y", "Var2.x", "Var2.y")]
print(mergedData)
Var1.x Var1.y Var2.x Var2.y
1 1 NA a <NA>
2 2 NA b <NA>
3 NA 3 <NA> c
4 NA 4 <NA> d
我写了一个函数,用于在其他脚本中合并两列,然后删除其中一个:
mergeColumns <- function(df, colToRemove, colToKeep) {
df[[colToKeep]][is.na(df[[colToKeep]])] <- df[[colToRemove]][is.na(df[[colToKeep]])]
df[[colToRemove]] <- NULL
names(df)[names(df) %in% colToKeep] <- gsub(".x|.y", "", names(df)[names(df) %in% colToKeep])
return(df)
}
这个函数在处理单个列时运行良好(尽管我确信有更清晰的方法)。然而,在我的合并数据框中,有许多需要合并的列。我不确定最佳方法是什么。我考虑过创建一个需要合并的列的数据框,并尝试使用apply函数族,但没有成功(我有点知道下面的方法不太可能成功,但认为它可能会让我朝正确的方向前进):
duplicatedColumns <- data.frame(colstoKeep = c("Var1.x", "Var2.x"),
coltoRemove = c("Var1.y", "Var2.y"))
lapply(data, mergeColumns(duplicatedColumns$sampleIntakeCols, duplicatedColumns$rnaMetadataCols))
# Error in match.fun(FUN) : argument "FUN" is missing, with no default
希望能提供如何让这个工作或更好方法的任何见解。最好使用基本的R。提前感谢您。
英文:
I have two dataframes that I need to merge in R. I can't really modify them before the merge, and they share many columns, but there are no overlapping values in the duplicated columns, if one row in one column has a value then the same row in the duplicated column is NA. After the merge, I have something that looks like this:
df1 <- data.frame(id = c(1,2,3,4),
Var1 = c(1,2,NA,NA),
Var2 = c("a","b",NA,NA))
df2 <- data.frame(id = c(1,2,3,4),
Var1 = c(NA,NA,3,4),
Var2 = c(NA,NA,"c","d"))
mergedData <- merge(df1, df2, by = "id")
mergedData <- mergedData[, c("Var1.x", "Var1.y", "Var2.x", "Var2.y")]
print(mergedData)
Var1.x Var1.y Var2.x Var2.y
1 1 NA a <NA>
2 2 NA b <NA>
3 NA 3 <NA> c
4 NA 4 <NA> d
I wrote a function to use in other scripts that merges two columns and then removes one of them:
mergeColumns <- function(df, colToRemove, colToKeep) {
df[[colToKeep]][is.na(df[[colToKeep]])] <- df[[colToRemove]][is.na(df[[colToKeep]])]
df[[colToRemove]] <- NULL
names(df)[names(df) %in% colToKeep] <- gsub(".x|.y", "", names(df)[names(df) %in% colToKeep])
return(df)
}
This works fine (although I'm sure there is a cleaner approach) when working with single columns. However, in my merged dataframe I have many columns that need to be merged. I am not sure of the best way to do this. I thought about creating a dataframe of the columns that need to be merged and tried using the apply family of functions with no success (I sort of knew the approach below was unlikely to work but thought it might get me going in the right direction):
duplicatedColumns <- data.frame(colstoKeep = c("Var1.x", "Var2.x'"),
coltoRemove = c("Var1.y", "Var2.y"))
lapply(data, mergeColumns(duplicatedColumns$sampleIntakeCols, duplicatedColumns$rnaMetadataCols))
>Error in match.fun(FUN) : argument "FUN" is missing, with no default
Any insight into how I might be able to get this to work of a better approach would be much appreciated. Base R would be preferable. Thanks in advance.
答案1
得分: 1
给定你需要递归修改数据框的情况,我只会使用一个老式的循环。你也可以使用 `grep()` 来识别列。
colstoKeep <- grep("\.x$", names(mergedData), value = TRUE)
colstoRemove <- grep("\.y$", names(mergedData), value = TRUE)
for (i in seq_along(colstoKeep)) {
mergedData <- mergeColumns(mergedData, colstoKeep[[i]], colstoRemove[[i]])
}
mergedData
Var1 Var2
1 1 a
2 2 b
3 1 c
4 2 d
<details>
<summary>英文:</summary>
Given that you need to recursively modify the dataframe, I’d just use a good old-fashioned loop. You can also use `grep()` to identify the columns.
colstoKeep <- grep("\.x$", names(mergedData), value = TRUE)
colstoRemove <- grep("\.y$", names(mergedData), value = TRUE)
for (i in seq_along(colstoKeep)) {
mergedData <- mergeColumns(mergedData, colstoKeep[[i]], colstoRemove[[i]])
}
mergedData
Var1 Var2
1 1 a
2 2 b
3 1 c
4 2 d
</details>
# 答案2
**得分**: 1
你可以使用循环或`apply`函数族。尝试这样做:
```R
mergedData <- data.frame(Var1.x = c(1,2,NA,NA),
Var1.y = c(NA,NA,1,2),
Var2.x = c("a", "b", NA, NA),
Var2.y = c(NA,NA,"c","d"))
duplicatedColumns <- data.frame(colstoKeep = c("Var1.x", "Var2.x"),
coltoRemove = c("Var1.y", "Var2.y"))
for (i in 1:nrow(duplicatedColumns)) {
colToKeep <- duplicatedColumns$colstoKeep[i]
colToRemove <- duplicatedColumns$coltoRemove[i]
mergedData[[colToKeep]][is.na(mergedData[[colToKeep]])] <- mergedData[[colToRemove]][is.na(mergedData[[colToKeep]])]
mergedData[[colToRemove]] <- NULL
names(mergedData)[names(mergedData) == colToKeep] <- gsub(".x|.y", "", colToKeep)
}
英文:
you can use a loop or the apply family of functions. Try this:
mergedData <- data.frame(Var1.x = c(1,2,NA,NA),
Var1.y = c(NA,NA,1,2),
Var2.x = c("a", "b", NA, NA),
Var2.y = c(NA,NA,"c","d"))
duplicatedColumns <- data.frame(colstoKeep = c("Var1.x", "Var2.x"),
coltoRemove = c("Var1.y", "Var2.y"))
for (i in 1:nrow(duplicatedColumns)) {
colToKeep <- duplicatedColumns$colstoKeep[i]
colToRemove <- duplicatedColumns$coltoRemove[i]
mergedData[[colToKeep]][is.na(mergedData[[colToKeep]])] <- mergedData[[colToRemove]][is.na(mergedData[[colToKeep]])]
mergedData[[colToRemove]] <- NULL
names(mergedData)[names(mergedData) == colToKeep] <- gsub(".x|.y", "", colToKeep)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论