英文:
Apply function to dataframe-column (dplyr)
问题
Here is the translation of the code parts you provided:
我有一个包含一个自由选择的“代码”列的数据框架(例如“abcde”,“blablabla”)。由于数据来自在线调查,有时用户会输错他们应该输入的代码。例如,他们写成了“bacde”而不是“abcde”。
我编写了一个名为`correctCodes()`的函数,它返回“纠正后”的代码。它使用Levenshtein距离来计算两个字符串之间的相似度。只有当至少有两个“参考代码”并且相似度超过某个阈值(.8)时,才会纠正代码。
函数计算两个字符串之间的相似度的部分如下:
```r
# 计算两个字符串之间的相似度的函数
mylevsim = function (str1, str2) {
return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
}
返回纠正代码的函数如下:
# 返回纠正后的代码的函数
correctCodes <- function(wrongCode) {
wrongCode <- trimws(toupper(wrongCode))
counts <- df %>%
mutate(code = trimws(toupper(code))) %>%
group_by(code) %>%
summarise(n = n(),
Var2 = code,
code = NULL) %>%
filter(!duplicated(code))
expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
mutate(similarity = mylevsim(Var1, Var2)) %>%
arrange(Var1, desc(similarity)) %>%
left_join(counts, by = "Var2") %>%
mutate(both = paste(Var1, Var2)) %>%
filter(!duplicated(both),
Var1 != Var2,
Var1 == wrongCode,
n > 2,
similarity > .8) %>%
filter(row_number()==1) %>%
pull(Var2)
}
现在,如果我像这样调用函数,它可以正常工作(函数返回“ABCDE”):
correctCodes("abccde") # 返回 ABCDE
correctCodes("bbcde") # 返回 ABCDE
correctCodes("efghj") # 返回 EFGHI
然而,如果我想在dplyr::mutate()
中使用它,它完全不起作用:
df %>%
mutate(code_corrected = correctCodes(code))
请注意,要使此代码在dplyr::mutate()
中正常工作,您可能需要将函数correctCodes()
放在您的数据框外部,以确保它可以访问数据框df
。
英文:
I have a dataframe containing a column with some freely chosen "codes" (e.g. "abcde", "blablabla"). Because the data come from an online survey sometimes the users mistyped the code they were supposed to enter. For example they wrote "bacde" instead of "abcde".
library(RecordLinkage)
library(tidyverse)
# Sample data
df <- data.frame(code = c(rep("abcde", 20), "bbcde", "abccde", rep("efghi", 20), "efigh", "efghj"))
I wrote a function correctCodes()
that returns the "corrected" code. It uses levenshtein-distance to calculate the similarity between two strings. A code will only be corrected if there are at least two "reference codes" and if similarity exceeds a certain threshold (.8)
# function that calculates similarity between two strings
mylevsim = function (str1, str2) {
return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
}
# Function that returns the corrected code
correctCodes <- function(wrongCode) {
wrongCode <- trimws(toupper(wrongCode))
counts <- df %>%
mutate(code = trimws(toupper(code))) %>%
group_by(code) %>%
summarise(n = n(),
Var2 = code,
code = NULL) %>%
filter(!duplicated(code))
expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
mutate(similarity = mylevsim(Var1, Var2)) %>%
arrange(Var1, desc(similarity)) %>%
left_join(counts, by = "Var2") %>%
mutate(both = paste(Var1, Var2)) %>%
filter(!duplicated(both),
Var1 != Var2,
Var1 == wrongCode,
n > 2,
similarity > .8) %>%
filter(row_number()==1) %>%
pull(Var2)
}
Now, this works fine if I call the function like this (the function returns "ABCDE"):
correctCodes("abccde") # returns ABCDE
correctCodes("bbcde") # returns ABCDE
correctCodes("efghj") # returns EFGHI
However, if I want to use it within dplyr::mutate()
it doesn't work at all:
df %>%
mutate(code_corrected = correctCodes(code))
答案1
得分: 1
以下是您要翻译的内容:
mutate
将整个列作为向量传递给函数,但该函数目前只能接受单个字符串。
correctCodes(df$code)
#> character(0)
在mutate
之前使用rowwise()
将起作用:
df %>%
rowwise() %>%
mutate(code_corrected = correctCodes(code))
但还有两个问题...
- 传递正确的代码不起作用,因为
Var1 != Var2
将其作为过滤选项删除:
correctCodes("abcde")
#> character(0)
- 传递完全错误的字符串不起作用,因为这个字符串的得分为0.667,因此不满足0.8的阈值:
correctCodes("efigh")
#> character(0)
如果构建一个检查"代码已经正确"的功能,那么这将绕过问题#2。问题#3可能需要不同的解决方案(或选择得分最高的匹配项?)
编辑 - 可能的编辑函数/调用作为解决问题的一种方式:
correctCodes <- function(wrongCode) {
wrongCode <- trimws(toupper(wrongCode))
counts <- df %>%
mutate(code = trimws(toupper(code))) %>%
count(code) %>%
filter(n > 2) |>
rename(Var2 = code)
right_codes <- counts |>
pull(Var2)
if (wrongCode %in% right_codes) return(wrongCode)
expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
mutate(similarity = mylevsim(Var1, Var2)) %>%
arrange(Var1, desc(similarity)) %>%
inner_join(counts, by = "Var2") %>%
mutate(both = paste(Var1, Var2)) %>%
filter(!duplicated(both),
Var1 != Var2,
Var1 == wrongCode) %>%
arrange(desc(similarity)) %>%
filter(row_number()==1) %>%
pull(Var2)
}
df %>%
rowwise() %>%
mutate(code_corrected = correctCodes(code))
#> # A tibble: 44 × 2
#> # Rowwise:
#> code code_corrected
#> <chr> <chr>
#> 1 abcde ABCDE
#> 2 abcde ABCDE
#> 3 abcde ABCDE
#> 4 abcde ABCDE
#> 5 abcde ABCDE
#> 6 abcde ABCDE
#> 7 abcde ABCDE
#> 8 abcde ABCDE
#> 9 abcde ABCDE
#> 10 abcde ABCDE
#> # ℹ 34 more rows
希望这有助于您理解这段代码和问题的解决方法。
英文:
There are at least three things stopping your code from working correctly at the moment:
mutate
passes the whole column as a vector to the function, which can currently only take a single string.
correctCodes(df$code)
#> character(0)
Using rowwise()
ahead of mutate
would work:
df %>%
rowwise() %>%
mutate(code_corrected = correctCodes(code))
But there are two further problems...
- Passing a correct code doesn't work as
Var1 != Var2
drops that as an option in your filtering:
correctCodes("abcde")
#> character(0)
- Passing a super-wrong string doesn't work, as this one gives a score of 0.667 so doesn't pass the 0.8 threshold:
correctCodes("efigh")
#> character(0)
If you build in a check for "code already correct" then this would bypass problem #2. Problem #3 may require a different solution though (or choose the highest scoring match?)
Edit - possible edited function/call as one way of solving problem:
correctCodes <- function(wrongCode) {
wrongCode <- trimws(toupper(wrongCode))
counts <- df %>%
mutate(code = trimws(toupper(code))) %>%
count(code) %>%
filter(n > 2) |>
rename(Var2 = code)
right_codes <- counts |>
pull(Var2)
if (wrongCode %in% right_codes) return(wrongCode)
expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
mutate(similarity = mylevsim(Var1, Var2)) %>%
arrange(Var1, desc(similarity)) %>%
inner_join(counts, by = "Var2") %>%
mutate(both = paste(Var1, Var2)) %>%
filter(!duplicated(both),
Var1 != Var2,
Var1 == wrongCode) %>%
arrange(desc(similarity)) %>%
filter(row_number()==1) %>%
pull(Var2)
}
df %>%
rowwise() %>%
mutate(code_corrected = correctCodes(code))
#> # A tibble: 44 × 2
#> # Rowwise:
#> code code_corrected
#> <chr> <chr>
#> 1 abcde ABCDE
#> 2 abcde ABCDE
#> 3 abcde ABCDE
#> 4 abcde ABCDE
#> 5 abcde ABCDE
#> 6 abcde ABCDE
#> 7 abcde ABCDE
#> 8 abcde ABCDE
#> 9 abcde ABCDE
#> 10 abcde ABCDE
#> # ℹ 34 more rows
答案2
得分: 1
需要对这个函数进行向量化:
vectorized_correctCodes <- Vectorize(correctCodes)
df %>%
mutate(code_corrected = vectorized_correctCodes(code))
code code_corrected
1 abcde
2 abcde
3 abcde
4 abcde
5 abcde
6 abcde
7 abcde
8 abcde
9 abcde
10 abcde
11 abcde
12 abcde
13 abcde
14 abcde
15 abcde
16 abcde
17 abcde
18 abcde
19 abcde
20 abcde
21 bbcde ABCDE
22 abccde ABCDE
23 efghi
24 efghi
25 efghi
26 efghi
27 efghi
28 efghi
29 efghi
30 efghi
31 efghi
32 efghi
33 efghi
34 efghi
35 efghi
36 efghi
37 efghi
38 efghi
39 efghi
40 efghi
41 efghi
42 efghi
43 efigh
44 efghj EFGHI
英文:
You need to vectorize the function:
vectorized_correctCodes <- Vectorize(correctCodes)
df %>%
mutate(code_corrected = vectorized_correctCodes(code))
code code_corrected
1 abcde
2 abcde
3 abcde
4 abcde
5 abcde
6 abcde
7 abcde
8 abcde
9 abcde
10 abcde
11 abcde
12 abcde
13 abcde
14 abcde
15 abcde
16 abcde
17 abcde
18 abcde
19 abcde
20 abcde
21 bbcde ABCDE
22 abccde ABCDE
23 efghi
24 efghi
25 efghi
26 efghi
27 efghi
28 efghi
29 efghi
30 efghi
31 efghi
32 efghi
33 efghi
34 efghi
35 efghi
36 efghi
37 efghi
38 efghi
39 efghi
40 efghi
41 efghi
42 efghi
43 efigh
44 efghj EFGHI
答案3
得分: 0
I would use a different approach. As you solely depend on user input and you can assume that more people enter it correctly than wrong. I translated that in any entry > 2 we trust (feel free to change) and that means that the other ones we try to fix.
我会采用不同的方法。因为你完全依赖用户输入,可以假设输入正确的人比错误的人多。我将这个翻译为任何条目 > 2,我们信任(随时可以更改),这意味着其他条目我们尝试修复。
For that we split the data, then do a cartesian join and get all combinations we need to calculate the distance for. Then we check for 0.8 similarity and up and pick the best match if more than one was found. Then join your data back. for illustration I just added the column trusted and the similarity. So those you can fix your codes for.
为此,我们拆分数据,然后进行笛卡尔连接,并获取我们需要计算距离的所有组合。然后我们检查相似度是否大于等于0.8,并在找到多个匹配项时选择最佳匹配。然后将数据重新连接。为了说明问题,我只添加了一个名为trusted和相似度的列。所以你可以根据这些来修复你的代码。
code trusted similarity
1: ABCDE NA
2: ABCDE NA
3: ABCDE NA
4: ABCDE NA
5: ABCDE NA
6: ABCDE NA
7: ABCDE NA
8: ABCDE NA
9: ABCDE NA
10: ABCDE NA
11: ABCDE NA
12: ABCDE NA
13: ABCDE NA
14: ABCDE NA
15: ABCDE NA
16: ABCDE NA
17: ABCDE NA
18: ABCDE NA
19: ABCDE NA
20: ABCDE NA
21: BBCDE ABCDE 0.80000
22: ABCCDE ABCDE 0.83333
23: EFGHI NA
24: EFGHI NA
25: EFGHI NA
26: EFGHI NA
27: EFGHI NA
28: EFGHI NA
29: EFGHI NA
30: EFGHI NA
31: EFGHI NA
32: EFGHI NA
33: EFGHI NA
34: EFGHI NA
35: EFGHI NA
36: EFGHI NA
37: EFGHI NA
38: EFGHI NA
39: EFGHI NA
40: EFGHI NA
41: EFGHI NA
42: EFGHI NA
43: EFIGH NA
44: EFGHJ EFGHI 0.80000
英文:
I would use a different approach. As you solely depend on user input and you can assume that more people enter it correctly than wrong. I translated that in any entry > 2 we trust (feel free to change) and that means that the other ones we try to fix.
For that we split the data, then do a cartesian join and get all combinations we need to calculate the distance for. Then we check for 0.8 similarity and up and pick the best match if more than one was found. Then join your data back. for illustration I just added the column trusted and the similarity. So those you can fix your codes for.
library(data.table)
library(stringdist)
setDT(df)
# standardize your codes
df[, code := trimws(toupper(code))]
# split your data into trusted codes and those you want to "verify / fix"
trusted <- df[, .N, code][N > 2, code]
verify <- df[!code %in% trusted]
# do a cartesian join to get all combinations you want to calculate the distance
check <- verify[, .(trusted = unlist(trusted)), by = verify]
# calculate similarity, I used stringdist package here
check <- check[, similarity := stringsim(code, trusted, "lv")][similarity >= 0.8][, .SD[which.max(similarity)], code]
# join back, from there you can change them if you like
check[df, on = c("code")]
# code trusted similarity
# 1: ABCDE <NA> NA
# 2: ABCDE <NA> NA
# 3: ABCDE <NA> NA
# 4: ABCDE <NA> NA
# 5: ABCDE <NA> NA
# 6: ABCDE <NA> NA
# 7: ABCDE <NA> NA
# 8: ABCDE <NA> NA
# 9: ABCDE <NA> NA
# 10: ABCDE <NA> NA
# 11: ABCDE <NA> NA
# 12: ABCDE <NA> NA
# 13: ABCDE <NA> NA
# 14: ABCDE <NA> NA
# 15: ABCDE <NA> NA
# 16: ABCDE <NA> NA
# 17: ABCDE <NA> NA
# 18: ABCDE <NA> NA
# 19: ABCDE <NA> NA
# 20: ABCDE <NA> NA
# 21: BBCDE ABCDE 0.80000
# 22: ABCCDE ABCDE 0.83333
# 23: EFGHI <NA> NA
# 24: EFGHI <NA> NA
# 25: EFGHI <NA> NA
# 26: EFGHI <NA> NA
# 27: EFGHI <NA> NA
# 28: EFGHI <NA> NA
# 29: EFGHI <NA> NA
# 30: EFGHI <NA> NA
# 31: EFGHI <NA> NA
# 32: EFGHI <NA> NA
# 33: EFGHI <NA> NA
# 34: EFGHI <NA> NA
# 35: EFGHI <NA> NA
# 36: EFGHI <NA> NA
# 37: EFGHI <NA> NA
# 38: EFGHI <NA> NA
# 39: EFGHI <NA> NA
# 40: EFGHI <NA> NA
# 41: EFGHI <NA> NA
# 42: EFGHI <NA> NA
# 43: EFIGH <NA> NA
# 44: EFGHJ EFGHI 0.80000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论