test if strings from two datasets match using subject id columns from twi datasets by subject id

huangapple go评论56阅读模式
英文:

test if strings from two datasets match using subject id columns from twi datasets by subject id

问题

以下是您要翻译的内容:

我正在尝试对两列进行简单匹配,但我需要按subject id进行匹配,并且输出应该是两个数据集中不匹配的subject id。

数据:

df1 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))


我尝试过的内容:

df3 <- df1 %>% select(subject_id, edta_code)

df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2


上面的代码不起作用,它会产生错误:

Error in $<-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5


期望的输出:

[![enter image description here][1]][1]

其中0表示否,1表示是。
然后在那之后,我提取了subject id的值为0的记录。

|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|

注意:尽管最后两个subject_id具有正确的代码,但subject_id是不正确的。
[1]: https://i.stack.imgur.com/NFeTV.png

<details>
<summary>英文:</summary>

I am trying to do a simple matching of two columns, but I need it to match by subject id and the output should be the subject id&#39;s which do not match in the two datasets. 

data:

df1 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))

what I have tried

df3 <- df1 %>% select(subject_id, edta_code)

df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2

The above code doesn&#39;t work, it gives error: 

Error in $&lt;-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5


desired output: 

[![enter image description here][1]][1]

where 0 is no and 1 is yes. 
Then after that I pull the subject id&#39;s with 0. 

|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|

Note: although the last two subject_ids have the correct codes the subject_ids are incorrect. 


  [1]: https://i.stack.imgur.com/NFeTV.png

</details>


# 答案1
**得分**: 1

```R
library(tidyverse)

df1 %>%
  select(-ipv) %>%
  full_join(df2) %>%
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  replace_na(list(matching = 0))
Joining with `by = join_by(subject_id)`
# A tibble: 7 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-5467   EDTA45    EDTA45               1
2 191-6784   EDTA79    EDTA78               0
3 191-3457   EDTA20    EDTA20               1
4 191-0987   EDTA66    EDTA66               1
5 191-1245   EDTA12    EDTA14               0
6 191-1945   EDTA74    NA                   0
7 191-1947   NA        EDTA74               0
Pull the IDs

df1 %>%
  select(-ipv) %>%
  full_join(df2) %>%
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  filter(!matching %in% c(1))
  
# A tibble: 4 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-6784   EDTA79    EDTA78               0
2 191-1245   EDTA12    EDTA14               0
3 191-1945   EDTA74    NA                  NA
4 191-1947   NA        EDTA74              NA
英文:
library(tidyverse)

df1 %&gt;% 
  select(-ipv) %&gt;% 
  full_join(df2) %&gt;%  
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %&gt;% 
  replace_na(list(matching = 0))

Joining with `by = join_by(subject_id)`
# A tibble: 7 &#215; 4
  subject_id edta_code edta_code_df2 matching
  &lt;chr&gt;      &lt;chr&gt;     &lt;chr&gt;            &lt;dbl&gt;
1 191-5467   EDTA45    EDTA45               1
2 191-6784   EDTA79    EDTA78               0
3 191-3457   EDTA20    EDTA20               1
4 191-0987   EDTA66    EDTA66               1
5 191-1245   EDTA12    EDTA14               0
6 191-1945   EDTA74    NA                   0
7 191-1947   NA        EDTA74               0

Pull the IDs

df1 %&gt;% 
  select(-ipv) %&gt;% 
  full_join(df2) %&gt;%  
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %&gt;% 
  filter(!matching %in% c(1))

# A tibble: 4 &#215; 4
  subject_id edta_code edta_code_df2 matching
  &lt;chr&gt;      &lt;chr&gt;     &lt;chr&gt;            &lt;dbl&gt;
1 191-6784   EDTA79    EDTA78               0
2 191-1245   EDTA12    EDTA14               0
3 191-1945   EDTA74    NA                  NA
4 191-1947   NA        EDTA74              NA

答案2

得分: 1

在你的full_join之后,你需要处理edta_codeedta_code_df2列中的缺失值。在使用==进行相等性检查时,你可以添加!is.na()条件,这样如果这两列中的任何一列或两列都缺失,matching列将会得到FALSE或0。

library(tidyverse)

df3 <- full_join(df1, df2, by = "subject_id") %>%
  select(-ipv) %>%
  mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))

df3

  subject_id edta_code edta_code_df2 matching
1   191-5467    EDTA45        EDTA45        1
2   191-6784    EDTA79        EDTA78        0
3   191-3457    EDTA20        EDTA20        1
4   191-0987    EDTA66        EDTA66        1
5   191-1245    EDTA12        EDTA14        0
6   191-1945    EDTA74          <NA>        0
7   191-1947      <NA>        EDTA74        0

df3 %>%
  filter(matching == 0) %>%
  pull(subject_id)

[1] "191-6784" "191-1245" "191-1945" "191-1947"
英文:

After your full_join, you will need to deal with missing in either edta_code or edta_code_df2 columns. When checking for equality with == you can add !is.na() so that if either or both of those two columns is missing you will get FALSE or 0 for the matching column.

library(tidyverse)

df3 &lt;- full_join(df1, df2, by = &quot;subject_id&quot;) %&gt;%
  select(-ipv) %&gt;%
  mutate(matching = +(edta_code == edta_code_df2 &amp; !is.na(edta_code) &amp; !is.na(edta_code_df2)))

df3

  subject_id edta_code edta_code_df2 matching
1   191-5467    EDTA45        EDTA45        1
2   191-6784    EDTA79        EDTA78        0
3   191-3457    EDTA20        EDTA20        1
4   191-0987    EDTA66        EDTA66        1
5   191-1245    EDTA12        EDTA14        0
6   191-1945    EDTA74          &lt;NA&gt;        0
7   191-1947      &lt;NA&gt;        EDTA74        0

df3 %&gt;%
  filter(matching == 0) %&gt;%
  pull(subject_id)

[1] &quot;191-6784&quot; &quot;191-1245&quot; &quot;191-1945&quot; &quot;191-1947&quot;

huangapple
  • 本文由 发表于 2023年6月8日 17:53:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76430642.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定