test if strings from two datasets match using subject id columns from twi datasets by subject id

huangapple go评论87阅读模式
英文:

test if strings from two datasets match using subject id columns from twi datasets by subject id

问题

以下是您要翻译的内容:

  1. 我正在尝试对两列进行简单匹配,但我需要按subject id进行匹配,并且输出应该是两个数据集中不匹配的subject id
  2. 数据:

df1 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))

  1. 我尝试过的内容:

df3 <- df1 %>% select(subject_id, edta_code)

df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2

  1. 上面的代码不起作用,它会产生错误:

Error in $<-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5

  1. 期望的输出:
  2. [![enter image description here][1]][1]
  3. 其中0表示否,1表示是。
  4. 然后在那之后,我提取了subject id的值为0的记录。
  5. |subject_id|
  6. |--------|
  7. |191-6784|
  8. |191-1245|
  9. |191-1947|
  10. |191-1945|
  11. 注意:尽管最后两个subject_id具有正确的代码,但subject_id是不正确的。
  12. [1]: https://i.stack.imgur.com/NFeTV.png
  13. <details>
  14. <summary>英文:</summary>
  15. I am trying to do a simple matching of two columns, but I need it to match by subject id and the output should be the subject id&#39;s which do not match in the two datasets.
  16. data:

df1 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))

  1. what I have tried

df3 <- df1 %>% select(subject_id, edta_code)

df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2

  1. The above code doesn&#39;t work, it gives error:

Error in $&lt;-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5

  1. desired output:
  2. [![enter image description here][1]][1]
  3. where 0 is no and 1 is yes.
  4. Then after that I pull the subject id&#39;s with 0.
  5. |subject_id|
  6. |--------|
  7. |191-6784|
  8. |191-1245|
  9. |191-1947|
  10. |191-1945|
  11. Note: although the last two subject_ids have the correct codes the subject_ids are incorrect.
  12. [1]: https://i.stack.imgur.com/NFeTV.png
  13. </details>
  14. # 答案1
  15. **得分**: 1
  16. ```R
  17. library(tidyverse)
  18. df1 %>%
  19. select(-ipv) %>%
  20. full_join(df2) %>%
  21. mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  22. replace_na(list(matching = 0))
  1. Joining with `by = join_by(subject_id)`
  2. # A tibble: 7 × 4
  3. subject_id edta_code edta_code_df2 matching
  4. <chr> <chr> <chr> <dbl>
  5. 1 191-5467 EDTA45 EDTA45 1
  6. 2 191-6784 EDTA79 EDTA78 0
  7. 3 191-3457 EDTA20 EDTA20 1
  8. 4 191-0987 EDTA66 EDTA66 1
  9. 5 191-1245 EDTA12 EDTA14 0
  10. 6 191-1945 EDTA74 NA 0
  11. 7 191-1947 NA EDTA74 0
  1. Pull the IDs
  2. df1 %>%
  3. select(-ipv) %>%
  4. full_join(df2) %>%
  5. mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  6. filter(!matching %in% c(1))
  7. # A tibble: 4 × 4
  8. subject_id edta_code edta_code_df2 matching
  9. <chr> <chr> <chr> <dbl>
  10. 1 191-6784 EDTA79 EDTA78 0
  11. 2 191-1245 EDTA12 EDTA14 0
  12. 3 191-1945 EDTA74 NA NA
  13. 4 191-1947 NA EDTA74 NA
英文:
  1. library(tidyverse)
  2. df1 %&gt;%
  3. select(-ipv) %&gt;%
  4. full_join(df2) %&gt;%
  5. mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %&gt;%
  6. replace_na(list(matching = 0))
  7. Joining with `by = join_by(subject_id)`
  8. # A tibble: 7 &#215; 4
  9. subject_id edta_code edta_code_df2 matching
  10. &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
  11. 1 191-5467 EDTA45 EDTA45 1
  12. 2 191-6784 EDTA79 EDTA78 0
  13. 3 191-3457 EDTA20 EDTA20 1
  14. 4 191-0987 EDTA66 EDTA66 1
  15. 5 191-1245 EDTA12 EDTA14 0
  16. 6 191-1945 EDTA74 NA 0
  17. 7 191-1947 NA EDTA74 0

Pull the IDs

  1. df1 %&gt;%
  2. select(-ipv) %&gt;%
  3. full_join(df2) %&gt;%
  4. mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %&gt;%
  5. filter(!matching %in% c(1))
  6. # A tibble: 4 &#215; 4
  7. subject_id edta_code edta_code_df2 matching
  8. &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
  9. 1 191-6784 EDTA79 EDTA78 0
  10. 2 191-1245 EDTA12 EDTA14 0
  11. 3 191-1945 EDTA74 NA NA
  12. 4 191-1947 NA EDTA74 NA

答案2

得分: 1

在你的full_join之后,你需要处理edta_codeedta_code_df2列中的缺失值。在使用==进行相等性检查时,你可以添加!is.na()条件,这样如果这两列中的任何一列或两列都缺失,matching列将会得到FALSE或0。

  1. library(tidyverse)
  2. df3 <- full_join(df1, df2, by = "subject_id") %>%
  3. select(-ipv) %>%
  4. mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))
  5. df3
  6. subject_id edta_code edta_code_df2 matching
  7. 1 191-5467 EDTA45 EDTA45 1
  8. 2 191-6784 EDTA79 EDTA78 0
  9. 3 191-3457 EDTA20 EDTA20 1
  10. 4 191-0987 EDTA66 EDTA66 1
  11. 5 191-1245 EDTA12 EDTA14 0
  12. 6 191-1945 EDTA74 <NA> 0
  13. 7 191-1947 <NA> EDTA74 0
  14. df3 %>%
  15. filter(matching == 0) %>%
  16. pull(subject_id)
  17. [1] "191-6784" "191-1245" "191-1945" "191-1947"
英文:

After your full_join, you will need to deal with missing in either edta_code or edta_code_df2 columns. When checking for equality with == you can add !is.na() so that if either or both of those two columns is missing you will get FALSE or 0 for the matching column.

  1. library(tidyverse)
  2. df3 &lt;- full_join(df1, df2, by = &quot;subject_id&quot;) %&gt;%
  3. select(-ipv) %&gt;%
  4. mutate(matching = +(edta_code == edta_code_df2 &amp; !is.na(edta_code) &amp; !is.na(edta_code_df2)))
  5. df3
  6. subject_id edta_code edta_code_df2 matching
  7. 1 191-5467 EDTA45 EDTA45 1
  8. 2 191-6784 EDTA79 EDTA78 0
  9. 3 191-3457 EDTA20 EDTA20 1
  10. 4 191-0987 EDTA66 EDTA66 1
  11. 5 191-1245 EDTA12 EDTA14 0
  12. 6 191-1945 EDTA74 &lt;NA&gt; 0
  13. 7 191-1947 &lt;NA&gt; EDTA74 0
  14. df3 %&gt;%
  15. filter(matching == 0) %&gt;%
  16. pull(subject_id)
  17. [1] &quot;191-6784&quot; &quot;191-1245&quot; &quot;191-1945&quot; &quot;191-1947&quot;

huangapple
  • 本文由 发表于 2023年6月8日 17:53:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76430642.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定