使用模糊字符串匹配(stringdist_join())在文本字符串上连接数据框。

huangapple go评论63阅读模式
英文:

Joining dataframes on text strings using fuzzy string matching (stringdist_join())

问题

我正在尝试基于两个变量的值将两个数据集连接起来。两个数据集具有相同的变量名称/列数,但行数可能不同。我想根据分组变量("SampleID")和包含文本字符串的变量("PrimConstruct")来连接它们。我希望使用模糊匹配来考虑"PrimConstruct"中轻微的拼写错误和大小写差异。还有可能一个数据集具有另一个数据集没有的文本值,我希望将其保留为单独的行。

我想要进行的模糊匹配仅发生在相同的SampleID内,因此"abc0101"的"cohesion"不会与"bcd0201"的"cohesion"匹配。最终,我希望有一个数据集,其中保留了两个"PrimConstruct"列和所有值,但匹配的值在同一行:

我尝试过使用stringdist_join,但它似乎会重复并添加许多行。

我对不同的匹配方法有些困惑,但我不希望文本字符串的差异超过轻微的拼写错误和大小写差异,我认为这应该可以通过ignore_case来处理。

英文:

I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a variable that contains text strings ("PrimConstruct"). I want to use fuzzy matching to account for slight misspellings and differences in capitalizations in "PrimConstruct". There's also the chance that one dataset has a row with a text value that the other dataset does not, in which case I would like it to leave that as a separate row.

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                  "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                  "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                  "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                  "commitment", "Cohesion", "cognitiion")) 
# df2 has misspelling, different capitalization, 
# and entry with no close match

I would like the fuzzy matching to only occur within the same SampleID, so "cohesion" for abc0101 will not be matched with "cohesion" for bcd0201. I am eventually hoping to have a dataset that keeps both PrimConstruct columns and all the values, but the matched values are in the same row:

desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")

I have tried stringdist_join but it ends up duplicating and adding a bunch of rows somehow.

joined <- stringdist_join(df1,
                         df2,
                         by = c("PrimConstruct_a" = "PrimConstruct_b",
                                "SampleID_a" = "SampleID_b"),
                         mode = "full",
                         method = "jw",
                         max_dist = 2,
                         ignore_case = T)

I'm a little confused by the different matching methods, but I don't expect differences in text strings to be much more than minor spelling mistakes and differences in capitalization, which I believe should be covered by ignore_case.

答案1

得分: 1

你可以使用 {powerjoin}:

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                                 "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                                                             "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                                 "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                                                             "commitment", "Cohesion", "cognitiion")) 

powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#>   SampleID_a PrimConstruct_a PrimConstruct_b
#> 1    abc0101        cohesion        cohesion
#> 2    abc0101       cognition       cognition
#> 3    bcd0201        cohesion        Cohesion
#> 4    bcd0201       cognition      cognitiion
#> 5    bcd0201            <NA>      commitment

创建于2023年3月16日,使用 reprex v2.0.2

请注意,我将df2的第二列重命名为 PrimConstruct_b

英文:

You might use {powerjoin}:

df1 &lt;- data.frame(SampleID_a = c(&quot;abc0101&quot;, &quot;abc0101&quot;, &quot;bcd0201&quot;, 
                                 &quot;bcd0201&quot;), PrimConstruct_a = c(&quot;cohesion&quot;, &quot;cognition&quot;, 
                                                                 &quot;cohesion&quot;, &quot;cognition&quot;)) 
df2 &lt;- data.frame(SampleID_b = c(&quot;abc0101&quot;, &quot;abc0101&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;, 
                                 &quot;bcd0201&quot;), PrimConstruct_b = c(&quot;cohesion&quot;, &quot;cognition&quot;, 
                                                                 &quot;commitment&quot;, &quot;Cohesion&quot;, &quot;cognitiion&quot;)) 

powerjoin::power_full_join(df1, df2, by = c(SampleID_a = &quot;SampleID_b&quot;, ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) &lt; 2))
#&gt;   SampleID_a PrimConstruct_a PrimConstruct_b
#&gt; 1    abc0101        cohesion        cohesion
#&gt; 2    abc0101       cognition       cognition
#&gt; 3    bcd0201        cohesion        Cohesion
#&gt; 4    bcd0201       cognition      cognitiion
#&gt; 5    bcd0201            &lt;NA&gt;      commitment

<sup>Created on 2023-03-16 with reprex v2.0.2</sup>

Note that I renamed the second col of df2 to PrimConstruct_b

答案2

得分: 0

以下是代码部分的翻译:

使用`fuzzyjoin`包可以实现具有多列匹配和不同连接功能的字符串距离(在您的情况下,连接等值和一个模糊连接),可以像这样完成:

# 虚拟数据... 我引入了一个不匹配的情况
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"), 
                  PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), 
                  PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion")) 

# 构建行标识
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # 执行连接
    fuzzyjoin::fuzzy_left_join(df2,
                               # 定义连接列
                               by=c("SampleID_a" = "SampleID_b",
                                    "PrimConstruct_a" = "PrimConstruct_b"),
                               # 匹配函数列表(第一个应该很清晰)
                               match_fun = list(`==`, 
                                                # 函数返回布尔向量,其中允许的最大字符串距离为2,使用levenshtein
                                                function(x,y) 
                                                    stringdist::stringdist(x, y, method="lv") < 2)
     )

SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1    abc0101        cohesion  1    abc0101        cohesion
2    abc0101       cognition  2    abc0101       cognition
3    bcd0201        cohesion  3    bcd0201        Cohesion
4    bcd0201       cognition  4    bcd0201      cognitiion
5      proof           00000  5       <NA>            <NA>

正如您可能已经注意到,匹配可以返回多列,因为可能有更多的匹配满足条件。因此,您现在可以按组处理数据,这就是为什么我引入了"rn",因为您可能只希望获取最接近的匹配。为了向您展示如何实现这一点,我将留下一种不使用`fuzzyjoin`而使用`stringdist`包(`fuzzyjoin`基于它)的可选方法,因为它可能有助于更好地理解工作原理:

# 构建新变量行号以唯一标识df1行
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # 通过完全匹配的列(来自df1和来自df2的所有匹配)进行左连接
    dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
    # 进行stringdist计算,并如果NA(没有左连接匹配)则将结果设置为0,以便进行下一步
    dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                PrimConstruct_b), 0)) %>%
    # 按输入df1行ID进行分组
    dplyr::group_by(rn) %>%
    # 按照计算的dist列排序获取第一行
    # 这就是为什么之前将dist设置为NA的原因
    # with_ties false将只返回一个匹配,即使有两个具有相同距离的匹配
    dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>%
    # 解除分组以防止不需要的下游行为
    dplyr::ungroup() %>%
    # 我将dist重新设置为NA,如果它是NA的话(您可能已经选择或执行更多计算)
    dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

res
# A tibble: 5 × 5
  SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA

根据您的用例,有一些可能的调整:

1) 您可以在`fuzzyjoin`中的匹配函数中使用`tolower()`或`toupper()`(结果是相同的),或在对PrimConstruct_a和PrimConstruct_b执行连接或距离计算之前使用,这将解决所有由大写与小写造成的字符串距离,仅留下字母差异引起的问题。

2) 详细了解`stringdist`和因此`fuzzyjoin`包实现的不同字符串距离函数。这些差异可能有助于根据用例进行选择。

**编辑**

您似乎在寻找类似于全连接的功能,尽管这并不那么容易。解决它的一种方法是将前一步的数据分配给一个新变量(称为"res"),并识别来自df2中未使用的情况,然后将它们联接回结果,如下所示:

mis <- df2 %>% 
    # 通过反连接,从df2到结果,我们识别未匹配的情况
    dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a", 
                                 "PrimConstruct_b")) %>% 
    # 格式化数据,使其具有与res相同的列,以便能够进行联合/绑定
    dplyr::transmute(SampleID_a = SampleID_b, 
                     PrimConstruct_a = NA, 
                     rn = NA, 
                     PrimConstruct_b,
                     dist = NA)

mis
  SampleID_a

<details>
<summary>英文:</summary>

string distance with multiple colum matches and distinct join function (in your case join equi and one fuzzy join) can be done like this with the `fuzzyjoin` package:

    # dummy data ... I introduced a non matching case
    df1 &lt;- data.frame(SampleID_a = c(&quot;abc0101&quot;, &quot;abc0101&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;, &quot;proof&quot;), 
                      PrimConstruct_a = c(&quot;cohesion&quot;, &quot;cognition&quot;, &quot;cohesion&quot;, &quot;cognition&quot;, &quot;00000&quot;)) 
    df2 &lt;- data.frame(SampleID_b = c(&quot;abc0101&quot;, &quot;abc0101&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;, &quot;bcd0201&quot;), 
                      PrimConstruct_b = c(&quot;cohesion&quot;, &quot;cognition&quot;, &quot;commitment&quot;, &quot;Cohesion&quot;, &quot;cognitiion&quot;)) 

    # build row identification
    dplyr::mutate(df1, rn = dplyr::row_number()) %&gt;%
        # perform join
        fuzzyjoin::fuzzy_left_join(df2,
                                   # define join columns
                                   by=c(&quot;SampleID_a&quot; = &quot;SampleID_b&quot;,
                                        &quot;PrimConstruct_a&quot; = &quot;PrimConstruct_b&quot;),
                                   # list of match functions (first should be clear)
                                   match_fun = list(`==`, 
                                                    # function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
                                                    function(x,y) 
                                                        stringdist::stringdist(x, y, method=&quot;lv&quot;) &lt; 2)
         )

      SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
    1    abc0101        cohesion  1    abc0101        cohesion
    2    abc0101       cognition  2    abc0101       cognition
    3    bcd0201        cohesion  3    bcd0201        Cohesion
    4    bcd0201       cognition  4    bcd0201      cognitiion
    5      proof           00000  5       &lt;NA&gt;            &lt;NA&gt;

As you might have noticed the match can return more than one column as possibly there are more matches satisfying the condition. So you could work the data now group wise, which is why I introduced &quot;rn&quot;, as you might want onyl the closest match. To show you how you can get down to this, I will leave an optional approach that does not use the `fuzzyjoin` but the `stringdist` packge (which fuzzyjoin is based one), as it might help you to understand the workings better:

    # build new variable row number to identify df1 row uniuely
    res &lt;- dplyr::mutate(df1, rn = dplyr::row_number()) %&gt;%
        # left join by the column that matches exactly (all from df1 and all matching from df2
        dplyr::left_join(df2, by = c(&quot;SampleID_a&quot; = &quot;SampleID_b&quot;)) %&gt;%
        # stringdist calculus and set result to 0 if NA (no left join match) for next step
        dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                    PrimConstruct_b), 0)) %&gt;%
        # build grouping by input df1 rows id
        dplyr::group_by(rn) %&gt;%
        # get first row ordered by calculated dist column 
        # this is why dist was set 0 if NA before
        # with_ties false will return only one match even if there are two with the same distance
        dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %&gt;% 
        # release grouping to prevent unwanted behaviour down stream
        dplyr::ungroup() %&gt;%
        # I set dist back to NA where it was (you might already select or perform more calculations
        dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

    res
    # A tibble: 5 &#215; 5
      SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
      &lt;chr&gt;      &lt;chr&gt;           &lt;int&gt; &lt;chr&gt;           &lt;dbl&gt;
    1 abc0101    cohesion            1 cohesion            0
    2 abc0101    cognition           2 cognition           0
    3 bcd0201    cohesion            3 Cohesion            1
    4 bcd0201    cognition           4 cognitiion          1
    5 proof      00000               5 NA                 NA

Depending on your use case there are some possible tweaks:
 
1) You could use tolower() or toupper() (result is the same) in the matching function of the fuzzyjoin or prior to the join or distance calculation on PrimConstruct_a and PrimConstruct_b, which will solve all stringdistances caused by upper vs lower case and leave only issues of letter diferences to be counted.

2) Read into the distinct stringdistance functions the `stringdist` and  therefore `fuzzyjoin` package have implemented. There are some diference which might be helpfull depending on the usecase

**EDIT**

you are looking for something like a full join, though it is not that easy. One way to solve it is assigning the data from the prio step to a new variable (called it &quot;res&quot;) and identify non used cases from df2 to union them back to the result like this:

    mis &lt;- df2 %&gt;% 
        # through an antijoin we from df2 to the result we identify non matched cases
        dplyr::anti_join(res, by = c(&quot;SampleID_b&quot; = &quot;SampleID_a&quot;, 
                                     &quot;PrimConstruct_b&quot;)) %&gt;% 
        # format the data to have the same columns as res to be able to unionize/bind them
        dplyr::transmute(SampleID_a = SampleID_b, 
                         PrimConstruct_a = NA, 
                         rn = NA, 
                         PrimConstruct_b,
                         dist = NA)

    mis
      SampleID_a PrimConstruct_a rn PrimConstruct_b dist
    1    bcd0201              NA NA      commitment   NA


    dplyr::union(res, mis)
    # A tibble: 6 x 5
       SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
      &lt;chr&gt;      &lt;chr&gt;           &lt;int&gt; &lt;chr&gt;           &lt;dbl&gt;
    1 abc0101    cohesion            1 cohesion            0
    2 abc0101    cognition           2 cognition           0
    3 bcd0201    cohesion            3 Cohesion            1
    4 bcd0201    cognition           4 cognitiion          1
    5 proof      00000               5 NA                 NA
    6 bcd0201    NA                 NA commitment         NA

</details>



huangapple
  • 本文由 发表于 2023年3月7日 06:29:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75656430.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定