英文:
test if strings from two datasets match using subject id columns from twi datasets by subject id
问题
以下是您要翻译的内容:
我正在尝试对两列进行简单匹配,但我需要按subject id进行匹配,并且输出应该是两个数据集中不匹配的subject id。
数据:
df1 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))
我尝试过的内容:
df3 <- df1 %>% select(subject_id, edta_code)
df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2
上面的代码不起作用,它会产生错误:
Error in $<-.data.frame
(*tmp*
, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5
期望的输出:
[![enter image description here][1]][1]
其中0表示否,1表示是。
然后在那之后,我提取了subject id的值为0的记录。
|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|
注意:尽管最后两个subject_id具有正确的代码,但subject_id是不正确的。
[1]: https://i.stack.imgur.com/NFeTV.png
<details>
<summary>英文:</summary>
I am trying to do a simple matching of two columns, but I need it to match by subject id and the output should be the subject id's which do not match in the two datasets.
data:
df1 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))
what I have tried
df3 <- df1 %>% select(subject_id, edta_code)
df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2
The above code doesn't work, it gives error:
Error in $<-.data.frame
(*tmp*
, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5
desired output:
[![enter image description here][1]][1]
where 0 is no and 1 is yes.
Then after that I pull the subject id's with 0.
|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|
Note: although the last two subject_ids have the correct codes the subject_ids are incorrect.
[1]: https://i.stack.imgur.com/NFeTV.png
</details>
# 答案1
**得分**: 1
```R
library(tidyverse)
df1 %>%
select(-ipv) %>%
full_join(df2) %>%
mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
replace_na(list(matching = 0))
Joining with `by = join_by(subject_id)`
# A tibble: 7 × 4
subject_id edta_code edta_code_df2 matching
<chr> <chr> <chr> <dbl>
1 191-5467 EDTA45 EDTA45 1
2 191-6784 EDTA79 EDTA78 0
3 191-3457 EDTA20 EDTA20 1
4 191-0987 EDTA66 EDTA66 1
5 191-1245 EDTA12 EDTA14 0
6 191-1945 EDTA74 NA 0
7 191-1947 NA EDTA74 0
Pull the IDs
df1 %>%
select(-ipv) %>%
full_join(df2) %>%
mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
filter(!matching %in% c(1))
# A tibble: 4 × 4
subject_id edta_code edta_code_df2 matching
<chr> <chr> <chr> <dbl>
1 191-6784 EDTA79 EDTA78 0
2 191-1245 EDTA12 EDTA14 0
3 191-1945 EDTA74 NA NA
4 191-1947 NA EDTA74 NA
英文:
library(tidyverse)
df1 %>%
select(-ipv) %>%
full_join(df2) %>%
mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
replace_na(list(matching = 0))
Joining with `by = join_by(subject_id)`
# A tibble: 7 × 4
subject_id edta_code edta_code_df2 matching
<chr> <chr> <chr> <dbl>
1 191-5467 EDTA45 EDTA45 1
2 191-6784 EDTA79 EDTA78 0
3 191-3457 EDTA20 EDTA20 1
4 191-0987 EDTA66 EDTA66 1
5 191-1245 EDTA12 EDTA14 0
6 191-1945 EDTA74 NA 0
7 191-1947 NA EDTA74 0
Pull the IDs
df1 %>%
select(-ipv) %>%
full_join(df2) %>%
mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
filter(!matching %in% c(1))
# A tibble: 4 × 4
subject_id edta_code edta_code_df2 matching
<chr> <chr> <chr> <dbl>
1 191-6784 EDTA79 EDTA78 0
2 191-1245 EDTA12 EDTA14 0
3 191-1945 EDTA74 NA NA
4 191-1947 NA EDTA74 NA
答案2
得分: 1
在你的full_join
之后,你需要处理edta_code
或edta_code_df2
列中的缺失值。在使用==
进行相等性检查时,你可以添加!is.na()
条件,这样如果这两列中的任何一列或两列都缺失,matching
列将会得到FALSE
或0。
library(tidyverse)
df3 <- full_join(df1, df2, by = "subject_id") %>%
select(-ipv) %>%
mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))
df3
subject_id edta_code edta_code_df2 matching
1 191-5467 EDTA45 EDTA45 1
2 191-6784 EDTA79 EDTA78 0
3 191-3457 EDTA20 EDTA20 1
4 191-0987 EDTA66 EDTA66 1
5 191-1245 EDTA12 EDTA14 0
6 191-1945 EDTA74 <NA> 0
7 191-1947 <NA> EDTA74 0
df3 %>%
filter(matching == 0) %>%
pull(subject_id)
[1] "191-6784" "191-1245" "191-1945" "191-1947"
英文:
After your full_join
, you will need to deal with missing in either edta_code
or edta_code_df2
columns. When checking for equality with ==
you can add !is.na()
so that if either or both of those two columns is missing you will get FALSE
or 0 for the matching
column.
library(tidyverse)
df3 <- full_join(df1, df2, by = "subject_id") %>%
select(-ipv) %>%
mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))
df3
subject_id edta_code edta_code_df2 matching
1 191-5467 EDTA45 EDTA45 1
2 191-6784 EDTA79 EDTA78 0
3 191-3457 EDTA20 EDTA20 1
4 191-0987 EDTA66 EDTA66 1
5 191-1245 EDTA12 EDTA14 0
6 191-1945 EDTA74 <NA> 0
7 191-1947 <NA> EDTA74 0
df3 %>%
filter(matching == 0) %>%
pull(subject_id)
[1] "191-6784" "191-1245" "191-1945" "191-1947"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论