I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a variable that contains text strings ("PrimConstruct"). I want to use fuzzy matching to account for slight misspellings and differences in capitalizations in "PrimConstruct". There's also the chance that one dataset has a row with a text value that the other dataset does not, in which case I would like it to leave that as a separate row.

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                  "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                  "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                  "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                  "commitment", "Cohesion", "cognitiion")) 
# df2 has misspelling, different capitalization, 
# and entry with no close match

I would like the fuzzy matching to only occur within the same SampleID, so "cohesion" for abc0101 will not be matched with "cohesion" for bcd0201. I am eventually hoping to have a dataset that keeps both PrimConstruct columns and all the values, but the matched values are in the same row:

desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")

I have tried stringdist_join but it ends up duplicating and adding a bunch of rows somehow.

joined <- stringdist_join(df1,
                         by = c("PrimConstruct_a" = "PrimConstruct_b",
                                "SampleID_a" = "SampleID_b"),
                         mode = "full",
                         method = "jw",
                         max_dist = 2,
                         ignore_case = T)

I'm a little confused by the different matching methods, but I don't expect differences in text strings to be much more than minor spelling mistakes and differences in capitalization, which I believe should be covered by ignore_case.


得分: 0



# 虚拟数据... 我引入了一个不匹配的情况
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"), 
                  PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), 
                  PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion")) 

# 构建行标识
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # 执行连接
                               # 定义连接列
                               by=c("SampleID_a" = "SampleID_b",
                                    "PrimConstruct_a" = "PrimConstruct_b"),
                               # 匹配函数列表(第一个应该很清晰)
                               match_fun = list(`==`, 
                                                # 函数返回布尔向量,其中允许的最大字符串距离为2,使用levenshtein
                                                    stringdist::stringdist(x, y, method="lv") < 2)

SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1    abc0101        cohesion  1    abc0101        cohesion
2    abc0101       cognition  2    abc0101       cognition
3    bcd0201        cohesion  3    bcd0201        Cohesion
4    bcd0201       cognition  4    bcd0201      cognitiion
5      proof           00000  5       <NA>            <NA>


# 构建新变量行号以唯一标识df1行
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # 通过完全匹配的列(来自df1和来自df2的所有匹配)进行左连接
    dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
    # 进行stringdist计算,并如果NA(没有左连接匹配)则将结果设置为0,以便进行下一步
    dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                PrimConstruct_b), 0)) %>%
    # 按输入df1行ID进行分组
    dplyr::group_by(rn) %>%
    # 按照计算的dist列排序获取第一行
    # 这就是为什么之前将dist设置为NA的原因
    # with_ties false将只返回一个匹配,即使有两个具有相同距离的匹配
    dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>%
    # 解除分组以防止不需要的下游行为
    dplyr::ungroup() %>%
    # 我将dist重新设置为NA,如果它是NA的话(您可能已经选择或执行更多计算)
    dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

# A tibble: 5 × 5
  SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA


1) 您可以在`fuzzyjoin`中的匹配函数中使用`tolower()`或`toupper()`(结果是相同的),或在对PrimConstruct_a和PrimConstruct_b执行连接或距离计算之前使用,这将解决所有由大写与小写造成的字符串距离,仅留下字母差异引起的问题。

2) 详细了解`stringdist`和因此`fuzzyjoin`包实现的不同字符串距离函数。这些差异可能有助于根据用例进行选择。



mis <- df2 %>% 
    # 通过反连接,从df2到结果,我们识别未匹配的情况
    dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a", 
                                 "PrimConstruct_b")) %>% 
    # 格式化数据,使其具有与res相同的列,以便能够进行联合/绑定
    dplyr::transmute(SampleID_a = SampleID_b, 
                     PrimConstruct_a = NA, 
                     rn = NA, 
                     dist = NA)



string distance with multiple colum matches and distinct join function (in your case join equi and one fuzzy join) can be done like this with the `fuzzyjoin` package:

    # dummy data ... I introduced a non matching case
    df1 &lt;- data.frame(SampleID_a = c(&quot;abc0101&quot;, &quot;abc0101&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;, &quot;proof&quot;), 
                      PrimConstruct_a = c(&quot;cohesion&quot;, &quot;cognition&quot;, &quot;cohesion&quot;, &quot;cognition&quot;, &quot;00000&quot;)) 
    df2 &lt;- data.frame(SampleID_b = c(&quot;abc0101&quot;, &quot;abc0101&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;), 
                      PrimConstruct_b = c(&quot;cohesion&quot;, &quot;cognition&quot;, &quot;commitment&quot;, &quot;Cohesion&quot;, &quot;cognitiion&quot;)) 

    # build row identification
    dplyr::mutate(df1, rn = dplyr::row_number()) %&gt;%
        # perform join
                                   # define join columns
                                   by=c(&quot;SampleID_a&quot; = &quot;SampleID_b&quot;,
                                        &quot;PrimConstruct_a&quot; = &quot;PrimConstruct_b&quot;),
                                   # list of match functions (first should be clear)
                                   match_fun = list(`==`, 
                                                    # function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
                                                        stringdist::stringdist(x, y, method=&quot;lv&quot;) &lt; 2)

      SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
    1    abc0101        cohesion  1    abc0101        cohesion
    2    abc0101       cognition  2    abc0101       cognition
    3    bcd0201        cohesion  3    bcd0201        Cohesion
    4    bcd0201       cognition  4    bcd0201      cognitiion
    5      proof           00000  5       &lt;NA&gt;            &lt;NA&gt;

As you might have noticed the match can return more than one column as possibly there are more matches satisfying the condition. So you could work the data now group wise, which is why I introduced &quot;rn&quot;, as you might want onyl the closest match. To show you how you can get down to this, I will leave an optional approach that does not use the `fuzzyjoin` but the `stringdist` packge (which fuzzyjoin is based one), as it might help you to understand the workings better:

    # build new variable row number to identify df1 row uniuely
    res &lt;- dplyr::mutate(df1, rn = dplyr::row_number()) %&gt;%
        # left join by the column that matches exactly (all from df1 and all matching from df2
        dplyr::left_join(df2, by = c(&quot;SampleID_a&quot; = &quot;SampleID_b&quot;)) %&gt;%
        # stringdist calculus and set result to 0 if NA (no left join match) for next step
        dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                    PrimConstruct_b), 0)) %&gt;%
        # build grouping by input df1 rows id
        dplyr::group_by(rn) %&gt;%
        # get first row ordered by calculated dist column 
        # this is why dist was set 0 if NA before
        # with_ties false will return only one match even if there are two with the same distance
        dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %&gt;% 
        # release grouping to prevent unwanted behaviour down stream
        dplyr::ungroup() %&gt;%
        # I set dist back to NA where it was (you might already select or perform more calculations
        dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

    # A tibble: 5 &#215; 5
      SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
      &lt;chr&gt;      &lt;chr&gt;           &lt;int&gt; &lt;chr&gt;           &lt;dbl&gt;
    1 abc0101    cohesion            1 cohesion            0
    2 abc0101    cognition           2 cognition           0
    3 bcd0201    cohesion            3 Cohesion            1
    4 bcd0201    cognition           4 cognitiion          1
    5 proof      00000               5 NA                 NA

Depending on your use case there are some possible tweaks:
1) You could use tolower() or toupper() (result is the same) in the matching function of the fuzzyjoin or prior to the join or distance calculation on PrimConstruct_a and PrimConstruct_b, which will solve all stringdistances caused by upper vs lower case and leave only issues of letter diferences to be counted.

2) Read into the distinct stringdistance functions the `stringdist` and  therefore `fuzzyjoin` package have implemented. There are some diference which might be helpfull depending on the usecase


you are looking for something like a full join, though it is not that easy. One way to solve it is assigning the data from the prio step to a new variable (called it &quot;res&quot;) and identify non used cases from df2 to union them back to the result like this:

    mis &lt;- df2 %&gt;% 
        # through an antijoin we from df2 to the result we identify non matched cases
        dplyr::anti_join(res, by = c(&quot;SampleID_b&quot; = &quot;SampleID_a&quot;, 
                                     &quot;PrimConstruct_b&quot;)) %&gt;% 
        # format the data to have the same columns as res to be able to unionize/bind them
        dplyr::transmute(SampleID_a = SampleID_b, 
                         PrimConstruct_a = NA, 
                         rn = NA, 
                         dist = NA)

      SampleID_a PrimConstruct_a rn PrimConstruct_b dist
    1    bcd0201              NA NA      commitment   NA

    dplyr::union(res, mis)
    # A tibble: 6 x 5
       SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
      &lt;chr&gt;      &lt;chr&gt;           &lt;int&gt; &lt;chr&gt;           &lt;dbl&gt;
    1 abc0101    cohesion            1 cohesion            0
    2 abc0101    cognition           2 cognition           0
    3 bcd0201    cohesion            3 Cohesion            1
    4 bcd0201    cognition           4 cognitiion          1
    5 proof      00000               5 NA                 NA
    6 bcd0201    NA                 NA commitment         NA


