英文:
Joining dataframes on text strings using fuzzy string matching (stringdist_join())
问题
我正在尝试基于两个变量的值将两个数据集连接起来。两个数据集具有相同的变量名称/列数,但行数可能不同。我想根据分组变量("SampleID")和包含文本字符串的变量("PrimConstruct")来连接它们。我希望使用模糊匹配来考虑"PrimConstruct"中轻微的拼写错误和大小写差异。还有可能一个数据集具有另一个数据集没有的文本值,我希望将其保留为单独的行。
我想要进行的模糊匹配仅发生在相同的SampleID
内,因此"abc0101"的"cohesion"不会与"bcd0201"的"cohesion"匹配。最终,我希望有一个数据集,其中保留了两个"PrimConstruct"列和所有值,但匹配的值在同一行:
我尝试过使用stringdist_join
,但它似乎会重复并添加许多行。
我对不同的匹配方法有些困惑,但我不希望文本字符串的差异超过轻微的拼写错误和大小写差异,我认为这应该可以通过ignore_case
来处理。
英文:
I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a variable that contains text strings ("PrimConstruct"). I want to use fuzzy matching to account for slight misspellings and differences in capitalizations in "PrimConstruct". There's also the chance that one dataset has a row with a text value that the other dataset does not, in which case I would like it to leave that as a separate row.
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
# df2 has misspelling, different capitalization,
# and entry with no close match
I would like the fuzzy matching to only occur within the same SampleID
, so "cohesion" for abc0101 will not be matched with "cohesion" for bcd0201. I am eventually hoping to have a dataset that keeps both PrimConstruct columns and all the values, but the matched values are in the same row:
desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")
I have tried stringdist_join
but it ends up duplicating and adding a bunch of rows somehow.
joined <- stringdist_join(df1,
df2,
by = c("PrimConstruct_a" = "PrimConstruct_b",
"SampleID_a" = "SampleID_b"),
mode = "full",
method = "jw",
max_dist = 2,
ignore_case = T)
I'm a little confused by the different matching methods, but I don't expect differences in text strings to be much more than minor spelling mistakes and differences in capitalization, which I believe should be covered by ignore_case.
答案1
得分: 1
你可以使用 {powerjoin}:
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#> SampleID_a PrimConstruct_a PrimConstruct_b
#> 1 abc0101 cohesion cohesion
#> 2 abc0101 cognition cognition
#> 3 bcd0201 cohesion Cohesion
#> 4 bcd0201 cognition cognitiion
#> 5 bcd0201 <NA> commitment
创建于2023年3月16日,使用 reprex v2.0.2
请注意,我将df2的第二列重命名为 PrimConstruct_b
。
英文:
You might use {powerjoin}:
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#> SampleID_a PrimConstruct_a PrimConstruct_b
#> 1 abc0101 cohesion cohesion
#> 2 abc0101 cognition cognition
#> 3 bcd0201 cohesion Cohesion
#> 4 bcd0201 cognition cognitiion
#> 5 bcd0201 <NA> commitment
<sup>Created on 2023-03-16 with reprex v2.0.2</sup>
Note that I renamed the second col of df2 to PrimConstruct_b
答案2
得分: 0
以下是代码部分的翻译:
使用`fuzzyjoin`包可以实现具有多列匹配和不同连接功能的字符串距离(在您的情况下,连接等值和一个模糊连接),可以像这样完成:
# 虚拟数据... 我引入了一个不匹配的情况
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"),
PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"),
PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion"))
# 构建行标识
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
# 执行连接
fuzzyjoin::fuzzy_left_join(df2,
# 定义连接列
by=c("SampleID_a" = "SampleID_b",
"PrimConstruct_a" = "PrimConstruct_b"),
# 匹配函数列表(第一个应该很清晰)
match_fun = list(`==`,
# 函数返回布尔向量,其中允许的最大字符串距离为2,使用levenshtein
function(x,y)
stringdist::stringdist(x, y, method="lv") < 2)
)
SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1 abc0101 cohesion 1 abc0101 cohesion
2 abc0101 cognition 2 abc0101 cognition
3 bcd0201 cohesion 3 bcd0201 Cohesion
4 bcd0201 cognition 4 bcd0201 cognitiion
5 proof 00000 5 <NA> <NA>
正如您可能已经注意到,匹配可以返回多列,因为可能有更多的匹配满足条件。因此,您现在可以按组处理数据,这就是为什么我引入了"rn",因为您可能只希望获取最接近的匹配。为了向您展示如何实现这一点,我将留下一种不使用`fuzzyjoin`而使用`stringdist`包(`fuzzyjoin`基于它)的可选方法,因为它可能有助于更好地理解工作原理:
# 构建新变量行号以唯一标识df1行
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
# 通过完全匹配的列(来自df1和来自df2的所有匹配)进行左连接
dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
# 进行stringdist计算,并如果NA(没有左连接匹配)则将结果设置为0,以便进行下一步
dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
PrimConstruct_b), 0)) %>%
# 按输入df1行ID进行分组
dplyr::group_by(rn) %>%
# 按照计算的dist列排序获取第一行
# 这就是为什么之前将dist设置为NA的原因
# with_ties false将只返回一个匹配,即使有两个具有相同距离的匹配
dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>%
# 解除分组以防止不需要的下游行为
dplyr::ungroup() %>%
# 我将dist重新设置为NA,如果它是NA的话(您可能已经选择或执行更多计算)
dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))
res
# A tibble: 5 × 5
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
<chr> <chr> <int> <chr> <dbl>
1 abc0101 cohesion 1 cohesion 0
2 abc0101 cognition 2 cognition 0
3 bcd0201 cohesion 3 Cohesion 1
4 bcd0201 cognition 4 cognitiion 1
5 proof 00000 5 NA NA
根据您的用例,有一些可能的调整:
1) 您可以在`fuzzyjoin`中的匹配函数中使用`tolower()`或`toupper()`(结果是相同的),或在对PrimConstruct_a和PrimConstruct_b执行连接或距离计算之前使用,这将解决所有由大写与小写造成的字符串距离,仅留下字母差异引起的问题。
2) 详细了解`stringdist`和因此`fuzzyjoin`包实现的不同字符串距离函数。这些差异可能有助于根据用例进行选择。
**编辑**
您似乎在寻找类似于全连接的功能,尽管这并不那么容易。解决它的一种方法是将前一步的数据分配给一个新变量(称为"res"),并识别来自df2中未使用的情况,然后将它们联接回结果,如下所示:
mis <- df2 %>%
# 通过反连接,从df2到结果,我们识别未匹配的情况
dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a",
"PrimConstruct_b")) %>%
# 格式化数据,使其具有与res相同的列,以便能够进行联合/绑定
dplyr::transmute(SampleID_a = SampleID_b,
PrimConstruct_a = NA,
rn = NA,
PrimConstruct_b,
dist = NA)
mis
SampleID_a
<details>
<summary>英文:</summary>
string distance with multiple colum matches and distinct join function (in your case join equi and one fuzzy join) can be done like this with the `fuzzyjoin` package:
# dummy data ... I introduced a non matching case
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"),
PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"),
PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion"))
# build row identification
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
# perform join
fuzzyjoin::fuzzy_left_join(df2,
# define join columns
by=c("SampleID_a" = "SampleID_b",
"PrimConstruct_a" = "PrimConstruct_b"),
# list of match functions (first should be clear)
match_fun = list(`==`,
# function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
function(x,y)
stringdist::stringdist(x, y, method="lv") < 2)
)
SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1 abc0101 cohesion 1 abc0101 cohesion
2 abc0101 cognition 2 abc0101 cognition
3 bcd0201 cohesion 3 bcd0201 Cohesion
4 bcd0201 cognition 4 bcd0201 cognitiion
5 proof 00000 5 <NA> <NA>
As you might have noticed the match can return more than one column as possibly there are more matches satisfying the condition. So you could work the data now group wise, which is why I introduced "rn", as you might want onyl the closest match. To show you how you can get down to this, I will leave an optional approach that does not use the `fuzzyjoin` but the `stringdist` packge (which fuzzyjoin is based one), as it might help you to understand the workings better:
# build new variable row number to identify df1 row uniuely
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
# left join by the column that matches exactly (all from df1 and all matching from df2
dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
# stringdist calculus and set result to 0 if NA (no left join match) for next step
dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
PrimConstruct_b), 0)) %>%
# build grouping by input df1 rows id
dplyr::group_by(rn) %>%
# get first row ordered by calculated dist column
# this is why dist was set 0 if NA before
# with_ties false will return only one match even if there are two with the same distance
dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>%
# release grouping to prevent unwanted behaviour down stream
dplyr::ungroup() %>%
# I set dist back to NA where it was (you might already select or perform more calculations
dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))
res
# A tibble: 5 × 5
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
<chr> <chr> <int> <chr> <dbl>
1 abc0101 cohesion 1 cohesion 0
2 abc0101 cognition 2 cognition 0
3 bcd0201 cohesion 3 Cohesion 1
4 bcd0201 cognition 4 cognitiion 1
5 proof 00000 5 NA NA
Depending on your use case there are some possible tweaks:
1) You could use tolower() or toupper() (result is the same) in the matching function of the fuzzyjoin or prior to the join or distance calculation on PrimConstruct_a and PrimConstruct_b, which will solve all stringdistances caused by upper vs lower case and leave only issues of letter diferences to be counted.
2) Read into the distinct stringdistance functions the `stringdist` and therefore `fuzzyjoin` package have implemented. There are some diference which might be helpfull depending on the usecase
**EDIT**
you are looking for something like a full join, though it is not that easy. One way to solve it is assigning the data from the prio step to a new variable (called it "res") and identify non used cases from df2 to union them back to the result like this:
mis <- df2 %>%
# through an antijoin we from df2 to the result we identify non matched cases
dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a",
"PrimConstruct_b")) %>%
# format the data to have the same columns as res to be able to unionize/bind them
dplyr::transmute(SampleID_a = SampleID_b,
PrimConstruct_a = NA,
rn = NA,
PrimConstruct_b,
dist = NA)
mis
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
1 bcd0201 NA NA commitment NA
dplyr::union(res, mis)
# A tibble: 6 x 5
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
<chr> <chr> <int> <chr> <dbl>
1 abc0101 cohesion 1 cohesion 0
2 abc0101 cognition 2 cognition 0
3 bcd0201 cohesion 3 Cohesion 1
4 bcd0201 cognition 4 cognitiion 1
5 proof 00000 5 NA NA
6 bcd0201 NA NA commitment NA
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论