英文:
test if strings from two datasets match using subject id columns from twi datasets by subject id
问题
以下是您要翻译的内容:
我正在尝试对两列进行简单匹配,但我需要按subject id进行匹配,并且输出应该是两个数据集中不匹配的subject id。
数据:
df1 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))
我尝试过的内容:
df3 <- df1 %>% select(subject_id, edta_code)
df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2
上面的代码不起作用,它会产生错误:
Error in $<-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE,  :
replacement has 7 rows, data has 5
期望的输出:
[![enter image description here][1]][1]
其中0表示否,1表示是。
然后在那之后,我提取了subject id的值为0的记录。
|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|
注意:尽管最后两个subject_id具有正确的代码,但subject_id是不正确的。
[1]: https://i.stack.imgur.com/NFeTV.png
<details>
<summary>英文:</summary>
I am trying to do a simple matching of two columns, but I need it to match by subject id and the output should be the subject id's which do not match in the two datasets. 
data:
df1 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))
what I have tried
df3 <- df1 %>% select(subject_id, edta_code)
df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2
The above code doesn't work, it gives error: 
Error in $<-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE,  :
replacement has 7 rows, data has 5
desired output: 
[![enter image description here][1]][1]
where 0 is no and 1 is yes. 
Then after that I pull the subject id's with 0. 
|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|
Note: although the last two subject_ids have the correct codes the subject_ids are incorrect. 
  [1]: https://i.stack.imgur.com/NFeTV.png
</details>
# 答案1
**得分**: 1
```R
library(tidyverse)
df1 %>%
  select(-ipv) %>%
  full_join(df2) %>%
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  replace_na(list(matching = 0))
Joining with `by = join_by(subject_id)`
# A tibble: 7 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-5467   EDTA45    EDTA45               1
2 191-6784   EDTA79    EDTA78               0
3 191-3457   EDTA20    EDTA20               1
4 191-0987   EDTA66    EDTA66               1
5 191-1245   EDTA12    EDTA14               0
6 191-1945   EDTA74    NA                   0
7 191-1947   NA        EDTA74               0
Pull the IDs
df1 %>%
  select(-ipv) %>%
  full_join(df2) %>%
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  filter(!matching %in% c(1))
  
# A tibble: 4 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-6784   EDTA79    EDTA78               0
2 191-1245   EDTA12    EDTA14               0
3 191-1945   EDTA74    NA                  NA
4 191-1947   NA        EDTA74              NA
英文:
library(tidyverse)
df1 %>% 
  select(-ipv) %>% 
  full_join(df2) %>%  
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>% 
  replace_na(list(matching = 0))
Joining with `by = join_by(subject_id)`
# A tibble: 7 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-5467   EDTA45    EDTA45               1
2 191-6784   EDTA79    EDTA78               0
3 191-3457   EDTA20    EDTA20               1
4 191-0987   EDTA66    EDTA66               1
5 191-1245   EDTA12    EDTA14               0
6 191-1945   EDTA74    NA                   0
7 191-1947   NA        EDTA74               0
Pull the IDs
df1 %>% 
  select(-ipv) %>% 
  full_join(df2) %>%  
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>% 
  filter(!matching %in% c(1))
# A tibble: 4 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-6784   EDTA79    EDTA78               0
2 191-1245   EDTA12    EDTA14               0
3 191-1945   EDTA74    NA                  NA
4 191-1947   NA        EDTA74              NA
答案2
得分: 1
在你的full_join之后,你需要处理edta_code或edta_code_df2列中的缺失值。在使用==进行相等性检查时,你可以添加!is.na()条件,这样如果这两列中的任何一列或两列都缺失,matching列将会得到FALSE或0。
library(tidyverse)
df3 <- full_join(df1, df2, by = "subject_id") %>%
  select(-ipv) %>%
  mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))
df3
  subject_id edta_code edta_code_df2 matching
1   191-5467    EDTA45        EDTA45        1
2   191-6784    EDTA79        EDTA78        0
3   191-3457    EDTA20        EDTA20        1
4   191-0987    EDTA66        EDTA66        1
5   191-1245    EDTA12        EDTA14        0
6   191-1945    EDTA74          <NA>        0
7   191-1947      <NA>        EDTA74        0
df3 %>%
  filter(matching == 0) %>%
  pull(subject_id)
[1] "191-6784" "191-1245" "191-1945" "191-1947"
英文:
After your full_join, you will need to deal with missing in either edta_code or edta_code_df2 columns. When checking for equality with == you can add !is.na() so that if either or both of those two columns is missing you will get FALSE or 0 for the matching column.
library(tidyverse)
df3 <- full_join(df1, df2, by = "subject_id") %>%
  select(-ipv) %>%
  mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))
df3
  subject_id edta_code edta_code_df2 matching
1   191-5467    EDTA45        EDTA45        1
2   191-6784    EDTA79        EDTA78        0
3   191-3457    EDTA20        EDTA20        1
4   191-0987    EDTA66        EDTA66        1
5   191-1245    EDTA12        EDTA14        0
6   191-1945    EDTA74          <NA>        0
7   191-1947      <NA>        EDTA74        0
df3 %>%
  filter(matching == 0) %>%
  pull(subject_id)
[1] "191-6784" "191-1245" "191-1945" "191-1947"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论