2023年6月8日 17:53:40go评论97阅读模式

英文:

test if strings from two datasets match using subject id columns from twi datasets by subject id

问题

以下是您要翻译的内容：

我正在尝试对两列进行简单匹配，但我需要按subject id进行匹配，并且输出应该是两个数据集中不匹配的subject id。
数据：

df1 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))


我尝试过的内容：

df3 <- df1 %>% select(subject_id, edta_code)

df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2


上面的代码不起作用，它会产生错误：

Error in $<-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5


期望的输出：
[![enter image description here][1]][1]
其中0表示否，1表示是。
然后在那之后，我提取了subject id的值为0的记录。
|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|
注意：尽管最后两个subject_id具有正确的代码，但subject_id是不正确的。
[1]: https://i.stack.imgur.com/NFeTV.png
<details>
<summary>英文:</summary>
I am trying to do a simple matching of two columns, but I need it to match by subject id and the output should be the subject id&#39;s which do not match in the two datasets. 
data:

df1 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-1945"),
edta_code = c("EDTA45", "EDTA79", "EDTA20", "EDTA66", "EDTA12", "EDTA74"),
ipv = c("0", "3", "2", "2", "1", "2")),
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure (list(
subject_id = c("191-5467", "191-6784", "191-3457", "191-0987",
"191-1245", "191-1947"),
edta_code_df2 = c("EDTA45", "EDTA78", "EDTA20", "EDTA66", "EDTA14", "EDTA74")), class = "data.frame", row.names = c(NA, -6L))

what I have tried

df3 <- df1 %>% select(subject_id, edta_code)

df3 <- full_join(df3, df2, by = "subject_id")
df$all_matching <- df3$edta_code == df3$edta_code_df2

The above code doesn&#39;t work, it gives error:

Error in $<-.data.frame(*tmp*, all_matching, value = c(TRUE, FALSE, :
replacement has 7 rows, data has 5


desired output: 
[![enter image description here][1]][1]
where 0 is no and 1 is yes. 
Then after that I pull the subject id&#39;s with 0. 
|subject_id|
|--------|
|191-6784|
|191-1245|
|191-1947|
|191-1945|
Note: although the last two subject_ids have the correct codes the subject_ids are incorrect. 
  [1]: https://i.stack.imgur.com/NFeTV.png
</details>
# 答案1
**得分**: 1
```R
library(tidyverse)
df1 %>%
  select(-ipv) %>%
  full_join(df2) %>%
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  replace_na(list(matching = 0))

Joining with `by = join_by(subject_id)`
# A tibble: 7 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-5467   EDTA45    EDTA45               1
2 191-6784   EDTA79    EDTA78               0
3 191-3457   EDTA20    EDTA20               1
4 191-0987   EDTA66    EDTA66               1
5 191-1245   EDTA12    EDTA14               0
6 191-1945   EDTA74    NA                   0
7 191-1947   NA        EDTA74               0

Pull the IDs
df1 %>%
  select(-ipv) %>%
  full_join(df2) %>%
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %>%
  filter(!matching %in% c(1))
  
# A tibble: 4 × 4
  subject_id edta_code edta_code_df2 matching
  <chr>      <chr>     <chr>            <dbl>
1 191-6784   EDTA79    EDTA78               0
2 191-1245   EDTA12    EDTA14               0
3 191-1945   EDTA74    NA                  NA
4 191-1947   NA        EDTA74              NA

英文:

library(tidyverse)
df1 %&gt;% 
  select(-ipv) %&gt;% 
  full_join(df2) %&gt;%  
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %&gt;% 
  replace_na(list(matching = 0))
Joining with `by = join_by(subject_id)`
# A tibble: 7 &#215; 4
  subject_id edta_code edta_code_df2 matching
  &lt;chr&gt;      &lt;chr&gt;     &lt;chr&gt;            &lt;dbl&gt;
1 191-5467   EDTA45    EDTA45               1
2 191-6784   EDTA79    EDTA78               0
3 191-3457   EDTA20    EDTA20               1
4 191-0987   EDTA66    EDTA66               1
5 191-1245   EDTA12    EDTA14               0
6 191-1945   EDTA74    NA                   0
7 191-1947   NA        EDTA74               0

Pull the IDs

df1 %&gt;% 
  select(-ipv) %&gt;% 
  full_join(df2) %&gt;%  
  mutate(matching = if_else(edta_code == edta_code_df2, 1, 0)) %&gt;% 
  filter(!matching %in% c(1))
# A tibble: 4 &#215; 4
  subject_id edta_code edta_code_df2 matching
  &lt;chr&gt;      &lt;chr&gt;     &lt;chr&gt;            &lt;dbl&gt;
1 191-6784   EDTA79    EDTA78               0
2 191-1245   EDTA12    EDTA14               0
3 191-1945   EDTA74    NA                  NA
4 191-1947   NA        EDTA74              NA

答案2

得分: 1

在你的full_join之后，你需要处理edta_code或edta_code_df2列中的缺失值。在使用==进行相等性检查时，你可以添加!is.na()条件，这样如果这两列中的任何一列或两列都缺失，matching列将会得到FALSE或0。

library(tidyverse)
df3 <- full_join(df1, df2, by = "subject_id") %>%
  select(-ipv) %>%
  mutate(matching = +(edta_code == edta_code_df2 & !is.na(edta_code) & !is.na(edta_code_df2)))
df3
  subject_id edta_code edta_code_df2 matching
1   191-5467    EDTA45        EDTA45        1
2   191-6784    EDTA79        EDTA78        0
3   191-3457    EDTA20        EDTA20        1
4   191-0987    EDTA66        EDTA66        1
5   191-1245    EDTA12        EDTA14        0
6   191-1945    EDTA74          <NA>        0
7   191-1947      <NA>        EDTA74        0
df3 %>%
  filter(matching == 0) %>%
  pull(subject_id)
[1] "191-6784" "191-1245" "191-1945" "191-1947"

英文:

After your full_join, you will need to deal with missing in either edta_code or edta_code_df2 columns. When checking for equality with == you can add !is.na() so that if either or both of those two columns is missing you will get FALSE or 0 for the matching column.

library(tidyverse)
df3 &lt;- full_join(df1, df2, by = &quot;subject_id&quot;) %&gt;%
  select(-ipv) %&gt;%
  mutate(matching = +(edta_code == edta_code_df2 &amp; !is.na(edta_code) &amp; !is.na(edta_code_df2)))
df3
  subject_id edta_code edta_code_df2 matching
1   191-5467    EDTA45        EDTA45        1
2   191-6784    EDTA79        EDTA78        0
3   191-3457    EDTA20        EDTA20        1
4   191-0987    EDTA66        EDTA66        1
5   191-1245    EDTA12        EDTA14        0
6   191-1945    EDTA74          &lt;NA&gt;        0
7   191-1947      &lt;NA&gt;        EDTA74        0
df3 %&gt;%
  filter(matching == 0) %&gt;%
  pull(subject_id)
[1] &quot;191-6784&quot; &quot;191-1245&quot; &quot;191-1945&quot; &quot;191-1947&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

test if strings from two datasets match using subject id columns from twi datasets by subject id

问题

答案2

如何使用dplyr合并具有不同行的多个数据框。

Reframing output of confidence intervals to combine mean, upper and lower values into one cell

为什么它们会被连接在一起？这些值是不同的。

我有一列越南字符，当其作为 .csv 文件导入到 R 时无法正确读取。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。