英文:
Exclude non-matching strings from data frame
问题
我有两个数据框DFA
和DFB
,它们都有一个共同的ID列ID3
。我想要找出在DFA
中存在但在DFB
中缺失的ID3。然后,我想要将DFA
和DFB
合并在一起,但只保留共同的ID3。
以下是你尝试的代码,我将其翻译为中文:
# 找出在DFA中存在但在DFB中缺失的ID3
DIFFERENT_IDS <- anti_join(DFA, DFB, by = "ID2")
my_ID2_N <- as.vector(unlist(DIFFERENT_IDS$ID2))
# 从DFA中删除不包含在my_ID2_N中的ID3
DFA1 <- DFA
DFA1 <- DFA1[!(DFA1$ID2 %in% my_ID2_N),]
这段代码看起来基本正确,但你提到可能存在ID2多次出现的情况。如果你的目标是仅保留在两个数据框中都有的ID3,你可以考虑使用semi_join
而不是anti_join
来查找共同的ID3,然后合并数据框。以下是更新后的代码:
# 找出在DFA和DFB中都存在的ID3
COMMON_IDS <- semi_join(DFA, DFB, by = "ID2")
# 根据共同的ID3合并DFA和DFB
result <- merge(DFA, DFB, by = "ID2")
# 选取需要的列
result <- result[, c("ID3", "Time", "Day", "Month", "Year", "First", "Last", "Age")]
这段代码会找到在DFA
和DFB
中都存在的ID3,然后将它们合并在一起,并只选择所需的列。
英文:
I have two data frames DFA
and DFB
with a common ID column ID3
. I want to identify the ID3's which are present in DFA but missing from DFB. I then want to merge DFA and DFB together, but keeping only the common ID3's.
DFA:
Time | Day | Month | Year | ID3 |
---|---|---|---|---|
13 | 12 | 12 | 23 | 13_12_12_23 |
12 | 9 | 9 | 91 | 12_9_9_91 |
10 | 2 | 3 | 80 | 10_2_3_80 |
DFB:
First | Last | Age | ID3 |
---|---|---|---|
Tom | Smith | 25 | 13_12_12_23 |
Phil | Green | 41 | 6_11_8_45 |
Joe | Bloggs | 71 | 12_9_9_91 |
Ben | Bolton | 84 | 2_1_9_63 |
Kate | Jones | 34 | 10_2_3_80 |
Intended result:
ID3 | Time | Day | Month | Year | First | Last | Age |
---|---|---|---|---|---|---|---|
13_12_12_23 | 13 | 12 | 12 | 23 | Tom | Smith | 25 |
12_9_9_91 | 12 | 9 | 9 | 91 | Joe | Bloggs | 71 |
10_2_3_80 | 10 | 2 | 3 | 80 | Kate | Jones | 34 |
I have tried the following, which doesn't seem to be identifying the different ID2's correctly
DIFFERENT_IDS <-anti_join(DFA, DFB, by = "ID2")
my_ID2_N <- as.vector(unlist(DIFFERENT_IDS$ID2))
DFA1 <- DFA
DFA1 <-DFA1[!(DFA1$ID2 %in% my_ID2_N),]
Any suggestions of where I might be going wrong? It is probably worth noting too that the actual data frames I am merging contain different numbers of rows within each of the ID2 (so we have ID2's listed multiple times for each trial sometimes, depending on the data points being reported, so perhaps this is causing the issue?)
答案1
得分: 1
你下次应该提供一个可重现的代码...
如果这个方法奏效,请告诉我。我使用左连接来只保留与DFB中匹配的ID。
在R中,使用library(data.pasta)
可以获得最强大的插件(esquisse
也不错)。
DFA <- data.frame(
stringsAsFactors = FALSE,
Time = c(13L, 12L, 10L),
Day = c(12L, 9L, 2L),
Month = c(12L, 9L, 3L),
Year = c(23L, 91L, 80L),
ID3 = c("13_12_12_23", "12_9_9_91", "10_2_3_80")
)
DFB <- data.frame(
stringsAsFactors = FALSE,
First = c("Tom", "Phil", "Joe", "Ben", "Kate"),
Last = c("Smith", "Green", "Bloggs", "Bolton", "Jones"),
Age = c(25L, 41L, 71L, 84L, 34L),
ID3 = c("13_12_12_23",
"6_11_8_45","12_9_9_91","2_1_9_63","10_2_3_80")
)
left_join(DFA, DFB) %>%
relocate(ID3, .before = everything())
输出:
ID3 Time Day Month Year First Last Age
1 13_12_12_23 13 12 12 23 Tom Smith 25
2 12_9_9_91 12 9 9 91 Joe Bloggs 71
3 10_2_3_80 10 2 3 80 Kate Jones 34
英文:
You should really provide a reproducible code next time...
Let me know if that did the trick. I used left join to keep only the matching ID's from from DFB
use library(data.pasta)
to have the greatest addin possible in R (esquisse
is also nice)
DFA <- data.frame(
stringsAsFactors = FALSE,
Time = c(13L, 12L, 10L),
Day = c(12L, 9L, 2L),
Month = c(12L, 9L, 3L),
Year = c(23L, 91L, 80L),
ID3 = c("13_12_12_23", "12_9_9_91", "10_2_3_80")
)
DFB <- data.frame(
stringsAsFactors = FALSE,
First = c("Tom", "Phil", "Joe", "Ben", "Kate"),
Last = c("Smith", "Green", "Bloggs", "Bolton", "Jones"),
Age = c(25L, 41L, 71L, 84L, 34L),
ID3 = c("13_12_12_23",
"6_11_8_45","12_9_9_91","2_1_9_63","10_2_3_80")
)
left_join(DFA,DFB) %>%
relocate(ID3, .before = everything())
ID3 Time Day Month Year First Last Age
1 13_12_12_23 13 12 12 23 Tom Smith 25
2 12_9_9_91 12 9 9 91 Joe Bloggs 71
3 10_2_3_80 10 2 3 80 Kate Jones 34
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论