2023年4月17日 19:05:04go评论94阅读模式

英文:

Apply function to dataframe-column (dplyr)

问题

Here is the translation of the code parts you provided:

我有一个包含一个自由选择的“代码”列的数据框架（例如“abcde”，“blablabla”）。由于数据来自在线调查，有时用户会输错他们应该输入的代码。例如，他们写成了“bacde”而不是“abcde”。
我编写了一个名为`correctCodes()`的函数，它返回“纠正后”的代码。它使用Levenshtein距离来计算两个字符串之间的相似度。只有当至少有两个“参考代码”并且相似度超过某个阈值（.8）时，才会纠正代码。
函数计算两个字符串之间的相似度的部分如下：
```r
# 计算两个字符串之间的相似度的函数
mylevsim = function (str1, str2) {
  return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
}

返回纠正代码的函数如下：

# 返回纠正后的代码的函数
correctCodes <- function(wrongCode) {
  wrongCode <- trimws(toupper(wrongCode))
  
  counts <- df %>%
    mutate(code = trimws(toupper(code))) %>%
    group_by(code) %>%
    summarise(n = n(),
              Var2 = code,
              code = NULL) %>%
    filter(!duplicated(code))
  
  expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
    mutate(similarity = mylevsim(Var1, Var2)) %>%
    arrange(Var1, desc(similarity)) %>%
    left_join(counts, by = "Var2") %>%
    mutate(both = paste(Var1, Var2)) %>%
    filter(!duplicated(both),
           Var1 != Var2,
           Var1 == wrongCode,
           n > 2,
           similarity > .8) %>%
    filter(row_number()==1) %>%
    pull(Var2)
}

现在，如果我像这样调用函数，它可以正常工作（函数返回“ABCDE”）：

correctCodes("abccde")  # 返回 ABCDE
correctCodes("bbcde")   # 返回 ABCDE
correctCodes("efghj")   # 返回 EFGHI

然而，如果我想在dplyr::mutate()中使用它，它完全不起作用：

df %>%
  mutate(code_corrected = correctCodes(code))

请注意，要使此代码在dplyr::mutate()中正常工作，您可能需要将函数correctCodes()放在您的数据框外部，以确保它可以访问数据框df。

英文:

I have a dataframe containing a column with some freely chosen "codes" (e.g. "abcde", "blablabla"). Because the data come from an online survey sometimes the users mistyped the code they were supposed to enter. For example they wrote "bacde" instead of "abcde".

library(RecordLinkage)
library(tidyverse)
# Sample data
df &lt;- data.frame(code = c(rep(&quot;abcde&quot;, 20), &quot;bbcde&quot;, &quot;abccde&quot;, rep(&quot;efghi&quot;, 20), &quot;efigh&quot;, &quot;efghj&quot;))

I wrote a function correctCodes()that returns the "corrected" code. It uses levenshtein-distance to calculate the similarity between two strings. A code will only be corrected if there are at least two "reference codes" and if similarity exceeds a certain threshold (.8)

# function that calculates similarity between two strings
mylevsim = function (str1, str2) {
  return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
}
# Function that returns the corrected code
correctCodes &lt;- function(wrongCode) {
  wrongCode &lt;- trimws(toupper(wrongCode))
  
  counts &lt;- df %&gt;% 
    mutate(code = trimws(toupper(code))) %&gt;%
    group_by(code) %&gt;%
    summarise(n = n(),
              Var2 = code,
              code = NULL) %&gt;%
    filter(!duplicated(code))
  
expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %&gt;%
    mutate(similarity = mylevsim(Var1, Var2)) %&gt;%
    arrange(Var1, desc(similarity)) %&gt;%
    left_join(counts, by = &quot;Var2&quot;) %&gt;%
    mutate(both = paste(Var1, Var2)) %&gt;%
    filter(!duplicated(both),
           Var1 != Var2,
           Var1 == wrongCode,
           n &gt; 2,
           similarity &gt; .8) %&gt;%
     filter(row_number()==1) %&gt;%
     pull(Var2)
}

Now, this works fine if I call the function like this (the function returns "ABCDE"):

correctCodes(&quot;abccde&quot;)  # returns ABCDE
correctCodes(&quot;bbcde&quot;)   # returns ABCDE
correctCodes(&quot;efghj&quot;)   # returns EFGHI

However, if I want to use it within dplyr::mutate() it doesn't work at all:

df %&gt;%
  mutate(code_corrected = correctCodes(code))

答案1

得分: 1

以下是您要翻译的内容：

mutate将整个列作为向量传递给函数，但该函数目前只能接受单个字符串。

correctCodes(df$code)
#&gt; character(0)

在mutate之前使用rowwise()将起作用：

df %&gt;%
  rowwise() %&gt;%
  mutate(code_corrected = correctCodes(code))

但还有两个问题...

传递正确的代码不起作用，因为Var1 != Var2将其作为过滤选项删除：

correctCodes(&quot;abcde&quot;)
#&gt; character(0)

传递完全错误的字符串不起作用，因为这个字符串的得分为0.667，因此不满足0.8的阈值：

correctCodes(&quot;efigh&quot;)
#&gt; character(0)

如果构建一个检查"代码已经正确"的功能，那么这将绕过问题＃2。问题＃3可能需要不同的解决方案（或选择得分最高的匹配项？）

编辑 - 可能的编辑函数/调用作为解决问题的一种方式：

correctCodes &lt;- function(wrongCode) {
  
  wrongCode &lt;- trimws(toupper(wrongCode))
  
  counts &lt;- df %&gt;% 
    mutate(code = trimws(toupper(code))) %&gt;%
    count(code) %&gt;%
    filter(n &gt; 2) |&gt; 
    rename(Var2 = code)
  
  right_codes &lt;- counts |&gt; 
    pull(Var2)
  if (wrongCode %in% right_codes) return(wrongCode)
  
  expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %&gt;%
    mutate(similarity = mylevsim(Var1, Var2)) %&gt;%
    arrange(Var1, desc(similarity)) %&gt;%
    inner_join(counts, by = &quot;Var2&quot;) %&gt;%
    mutate(both = paste(Var1, Var2)) %&gt;%
    filter(!duplicated(both),
           Var1 != Var2,
           Var1 == wrongCode) %&gt;%
    arrange(desc(similarity)) %&gt;%
    filter(row_number()==1) %&gt;%
    pull(Var2)
} 
df %&gt;%
  rowwise() %&gt;%
  mutate(code_corrected = correctCodes(code))
#&gt; # A tibble: 44 &#215; 2
#&gt; # Rowwise: 
#&gt;    code  code_corrected
#&gt;    &lt;chr&gt; &lt;chr&gt;         
#&gt;  1 abcde ABCDE         
#&gt;  2 abcde ABCDE         
#&gt;  3 abcde ABCDE         
#&gt;  4 abcde ABCDE         
#&gt;  5 abcde ABCDE         
#&gt;  6 abcde ABCDE         
#&gt;  7 abcde ABCDE         
#&gt;  8 abcde ABCDE         
#&gt;  9 abcde ABCDE         
#&gt; 10 abcde ABCDE         
#&gt; # ℹ 34 more rows

希望这有助于您理解这段代码和问题的解决方法。

英文:

There are at least three things stopping your code from working correctly at the moment:

mutate passes the whole column as a vector to the function, which can currently only take a single string.

correctCodes(df$code)
#&gt; character(0)

Using rowwise() ahead of mutate would work:

df %&gt;%
  rowwise() %&gt;%
  mutate(code_corrected = correctCodes(code))

But there are two further problems...

Passing a correct code doesn't work as Var1 != Var2 drops that as an option in your filtering:

correctCodes(&quot;abcde&quot;)
#&gt; character(0)

Passing a super-wrong string doesn't work, as this one gives a score of 0.667 so doesn't pass the 0.8 threshold:

correctCodes(&quot;efigh&quot;)
#&gt; character(0)

If you build in a check for "code already correct" then this would bypass problem #2. Problem #3 may require a different solution though (or choose the highest scoring match?)

Edit - possible edited function/call as one way of solving problem:

correctCodes &lt;- function(wrongCode) {
  
  wrongCode &lt;- trimws(toupper(wrongCode))
  
  counts &lt;- df %&gt;% 
    mutate(code = trimws(toupper(code))) %&gt;%
    count(code) %&gt;%
    filter(n &gt; 2) |&gt; 
    rename(Var2 = code)
  
  right_codes &lt;- counts |&gt; 
    pull(Var2)
  if (wrongCode %in% right_codes) return(wrongCode)
  
  expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %&gt;%
    mutate(similarity = mylevsim(Var1, Var2)) %&gt;%
    arrange(Var1, desc(similarity)) %&gt;%
    inner_join(counts, by = &quot;Var2&quot;) %&gt;%
    mutate(both = paste(Var1, Var2)) %&gt;%
    filter(!duplicated(both),
           Var1 != Var2,
           Var1 == wrongCode) %&gt;%
    arrange(desc(similarity)) %&gt;%
    filter(row_number()==1) %&gt;%
    pull(Var2)
} 
df %&gt;%
  rowwise() %&gt;%
  mutate(code_corrected = correctCodes(code))
#&gt; # A tibble: 44 &#215; 2
#&gt; # Rowwise: 
#&gt;    code  code_corrected
#&gt;    &lt;chr&gt; &lt;chr&gt;         
#&gt;  1 abcde ABCDE         
#&gt;  2 abcde ABCDE         
#&gt;  3 abcde ABCDE         
#&gt;  4 abcde ABCDE         
#&gt;  5 abcde ABCDE         
#&gt;  6 abcde ABCDE         
#&gt;  7 abcde ABCDE         
#&gt;  8 abcde ABCDE         
#&gt;  9 abcde ABCDE         
#&gt; 10 abcde ABCDE         
#&gt; # ℹ 34 more rows

答案2

得分: 1

需要对这个函数进行向量化：
vectorized_correctCodes <- Vectorize(correctCodes)
df %>%
  mutate(code_corrected = vectorized_correctCodes(code))
     code code_corrected
1   abcde               
2   abcde               
3   abcde               
4   abcde               
5   abcde               
6   abcde               
7   abcde               
8   abcde               
9   abcde               
10  abcde               
11  abcde               
12  abcde               
13  abcde               
14  abcde               
15  abcde               
16  abcde               
17  abcde               
18  abcde               
19  abcde               
20  abcde               
21  bbcde          ABCDE
22 abccde          ABCDE
23  efghi               
24  efghi               
25  efghi               
26  efghi               
27  efghi               
28  efghi               
29  efghi               
30  efghi               
31  efghi               
32  efghi               
33  efghi               
34  efghi               
35  efghi               
36  efghi               
37  efghi               
38  efghi               
39  efghi               
40  efghi               
41  efghi               
42  efghi               
43  efigh               
44  efghj          EFGHI

英文:

You need to vectorize the function:

vectorized_correctCodes &lt;- Vectorize(correctCodes)
df %&gt;%
  mutate(code_corrected = vectorized_correctCodes(code))
     code code_corrected
1   abcde               
2   abcde               
3   abcde               
4   abcde               
5   abcde               
6   abcde               
7   abcde               
8   abcde               
9   abcde               
10  abcde               
11  abcde               
12  abcde               
13  abcde               
14  abcde               
15  abcde               
16  abcde               
17  abcde               
18  abcde               
19  abcde               
20  abcde               
21  bbcde          ABCDE
22 abccde          ABCDE
23  efghi               
24  efghi               
25  efghi               
26  efghi               
27  efghi               
28  efghi               
29  efghi               
30  efghi               
31  efghi               
32  efghi               
33  efghi               
34  efghi               
35  efghi               
36  efghi               
37  efghi               
38  efghi               
39  efghi               
40  efghi               
41  efghi               
42  efghi               
43  efigh               
44  efghj          EFGHI

答案3

得分: 0

I would use a different approach. As you solely depend on user input and you can assume that more people enter it correctly than wrong. I translated that in any entry > 2 we trust (feel free to change) and that means that the other ones we try to fix.

我会采用不同的方法。因为你完全依赖用户输入，可以假设输入正确的人比错误的人多。我将这个翻译为任何条目 > 2，我们信任（随时可以更改），这意味着其他条目我们尝试修复。

For that we split the data, then do a cartesian join and get all combinations we need to calculate the distance for. Then we check for 0.8 similarity and up and pick the best match if more than one was found. Then join your data back. for illustration I just added the column trusted and the similarity. So those you can fix your codes for.

为此，我们拆分数据，然后进行笛卡尔连接，并获取我们需要计算距离的所有组合。然后我们检查相似度是否大于等于0.8，并在找到多个匹配项时选择最佳匹配。然后将数据重新连接。为了说明问题，我只添加了一个名为trusted和相似度的列。所以你可以根据这些来修复你的代码。

code trusted similarity

1: ABCDE NA

2: ABCDE NA

3: ABCDE NA

4: ABCDE NA

5: ABCDE NA

6: ABCDE NA

7: ABCDE NA

8: ABCDE NA

9: ABCDE NA

10: ABCDE NA

11: ABCDE NA

12: ABCDE NA

13: ABCDE NA

14: ABCDE NA

15: ABCDE NA

16: ABCDE NA

17: ABCDE NA

18: ABCDE NA

19: ABCDE NA

20: ABCDE NA

21: BBCDE ABCDE 0.80000

22: ABCCDE ABCDE 0.83333

23: EFGHI NA

24: EFGHI NA

25: EFGHI NA

26: EFGHI NA

27: EFGHI NA

28: EFGHI NA

29: EFGHI NA

30: EFGHI NA

31: EFGHI NA

32: EFGHI NA

33: EFGHI NA

34: EFGHI NA

35: EFGHI NA

36: EFGHI NA

37: EFGHI NA

38: EFGHI NA

39: EFGHI NA

40: EFGHI NA

41: EFGHI NA

42: EFGHI NA

43: EFIGH NA

44: EFGHJ EFGHI 0.80000

英文:

library(data.table)
library(stringdist)
setDT(df)
# standardize your codes
df[, code := trimws(toupper(code))]
# split your data into trusted codes and those you want to &quot;verify / fix&quot;
trusted &lt;- df[, .N, code][N &gt; 2, code]
verify &lt;- df[!code %in% trusted]
# do a cartesian join to get all combinations you want to calculate the distance
check &lt;- verify[, .(trusted = unlist(trusted)), by = verify]
# calculate similarity, I used stringdist package here 
check &lt;- check[, similarity := stringsim(code, trusted, &quot;lv&quot;)][similarity &gt;= 0.8][, .SD[which.max(similarity)], code]
# join back, from there you can change them if you like
check[df, on = c(&quot;code&quot;)]
# code trusted similarity
# 1:  ABCDE    &lt;NA&gt;         NA
# 2:  ABCDE    &lt;NA&gt;         NA
# 3:  ABCDE    &lt;NA&gt;         NA
# 4:  ABCDE    &lt;NA&gt;         NA
# 5:  ABCDE    &lt;NA&gt;         NA
# 6:  ABCDE    &lt;NA&gt;         NA
# 7:  ABCDE    &lt;NA&gt;         NA
# 8:  ABCDE    &lt;NA&gt;         NA
# 9:  ABCDE    &lt;NA&gt;         NA
# 10:  ABCDE    &lt;NA&gt;         NA
# 11:  ABCDE    &lt;NA&gt;         NA
# 12:  ABCDE    &lt;NA&gt;         NA
# 13:  ABCDE    &lt;NA&gt;         NA
# 14:  ABCDE    &lt;NA&gt;         NA
# 15:  ABCDE    &lt;NA&gt;         NA
# 16:  ABCDE    &lt;NA&gt;         NA
# 17:  ABCDE    &lt;NA&gt;         NA
# 18:  ABCDE    &lt;NA&gt;         NA
# 19:  ABCDE    &lt;NA&gt;         NA
# 20:  ABCDE    &lt;NA&gt;         NA
# 21:  BBCDE   ABCDE    0.80000
# 22: ABCCDE   ABCDE    0.83333
# 23:  EFGHI    &lt;NA&gt;         NA
# 24:  EFGHI    &lt;NA&gt;         NA
# 25:  EFGHI    &lt;NA&gt;         NA
# 26:  EFGHI    &lt;NA&gt;         NA
# 27:  EFGHI    &lt;NA&gt;         NA
# 28:  EFGHI    &lt;NA&gt;         NA
# 29:  EFGHI    &lt;NA&gt;         NA
# 30:  EFGHI    &lt;NA&gt;         NA
# 31:  EFGHI    &lt;NA&gt;         NA
# 32:  EFGHI    &lt;NA&gt;         NA
# 33:  EFGHI    &lt;NA&gt;         NA
# 34:  EFGHI    &lt;NA&gt;         NA
# 35:  EFGHI    &lt;NA&gt;         NA
# 36:  EFGHI    &lt;NA&gt;         NA
# 37:  EFGHI    &lt;NA&gt;         NA
# 38:  EFGHI    &lt;NA&gt;         NA
# 39:  EFGHI    &lt;NA&gt;         NA
# 40:  EFGHI    &lt;NA&gt;         NA
# 41:  EFGHI    &lt;NA&gt;         NA
# 42:  EFGHI    &lt;NA&gt;         NA
# 43:  EFIGH    &lt;NA&gt;         NA
# 44:  EFGHJ   EFGHI    0.80000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

问题

答案1