将函数应用于数据框列(dplyr)

huangapple go评论94阅读模式
英文:

Apply function to dataframe-column (dplyr)

问题

Here is the translation of the code parts you provided:

  1. 我有一个包含一个自由选择的“代码”列的数据框架(例如“abcde”,“blablabla”)。由于数据来自在线调查,有时用户会输错他们应该输入的代码。例如,他们写成了“bacde”而不是“abcde”。
  2. 我编写了一个名为`correctCodes()`的函数,它返回“纠正后”的代码。它使用Levenshtein距离来计算两个字符串之间的相似度。只有当至少有两个“参考代码”并且相似度超过某个阈值(.8)时,才会纠正代码。
  3. 函数计算两个字符串之间的相似度的部分如下:
  4. ```r
  5. # 计算两个字符串之间的相似度的函数
  6. mylevsim = function (str1, str2) {
  7. return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
  8. }

返回纠正代码的函数如下:

  1. # 返回纠正后的代码的函数
  2. correctCodes <- function(wrongCode) {
  3. wrongCode <- trimws(toupper(wrongCode))
  4. counts <- df %>%
  5. mutate(code = trimws(toupper(code))) %>%
  6. group_by(code) %>%
  7. summarise(n = n(),
  8. Var2 = code,
  9. code = NULL) %>%
  10. filter(!duplicated(code))
  11. expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
  12. mutate(similarity = mylevsim(Var1, Var2)) %>%
  13. arrange(Var1, desc(similarity)) %>%
  14. left_join(counts, by = "Var2") %>%
  15. mutate(both = paste(Var1, Var2)) %>%
  16. filter(!duplicated(both),
  17. Var1 != Var2,
  18. Var1 == wrongCode,
  19. n > 2,
  20. similarity > .8) %>%
  21. filter(row_number()==1) %>%
  22. pull(Var2)
  23. }

现在,如果我像这样调用函数,它可以正常工作(函数返回“ABCDE”):

  1. correctCodes("abccde") # 返回 ABCDE
  2. correctCodes("bbcde") # 返回 ABCDE
  3. correctCodes("efghj") # 返回 EFGHI

然而,如果我想在dplyr::mutate()中使用它,它完全不起作用:

  1. df %>%
  2. mutate(code_corrected = correctCodes(code))

请注意,要使此代码在dplyr::mutate()中正常工作,您可能需要将函数correctCodes()放在您的数据框外部,以确保它可以访问数据框df

英文:

I have a dataframe containing a column with some freely chosen "codes" (e.g. "abcde", "blablabla"). Because the data come from an online survey sometimes the users mistyped the code they were supposed to enter. For example they wrote "bacde" instead of "abcde".

  1. library(RecordLinkage)
  2. library(tidyverse)
  3. # Sample data
  4. df &lt;- data.frame(code = c(rep(&quot;abcde&quot;, 20), &quot;bbcde&quot;, &quot;abccde&quot;, rep(&quot;efghi&quot;, 20), &quot;efigh&quot;, &quot;efghj&quot;))

I wrote a function correctCodes()that returns the "corrected" code. It uses levenshtein-distance to calculate the similarity between two strings. A code will only be corrected if there are at least two "reference codes" and if similarity exceeds a certain threshold (.8)

  1. # function that calculates similarity between two strings
  2. mylevsim = function (str1, str2) {
  3. return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
  4. }
  5. # Function that returns the corrected code
  6. correctCodes &lt;- function(wrongCode) {
  7. wrongCode &lt;- trimws(toupper(wrongCode))
  8. counts &lt;- df %&gt;%
  9. mutate(code = trimws(toupper(code))) %&gt;%
  10. group_by(code) %&gt;%
  11. summarise(n = n(),
  12. Var2 = code,
  13. code = NULL) %&gt;%
  14. filter(!duplicated(code))
  15. expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %&gt;%
  16. mutate(similarity = mylevsim(Var1, Var2)) %&gt;%
  17. arrange(Var1, desc(similarity)) %&gt;%
  18. left_join(counts, by = &quot;Var2&quot;) %&gt;%
  19. mutate(both = paste(Var1, Var2)) %&gt;%
  20. filter(!duplicated(both),
  21. Var1 != Var2,
  22. Var1 == wrongCode,
  23. n &gt; 2,
  24. similarity &gt; .8) %&gt;%
  25. filter(row_number()==1) %&gt;%
  26. pull(Var2)
  27. }

Now, this works fine if I call the function like this (the function returns "ABCDE"):

  1. correctCodes(&quot;abccde&quot;) # returns ABCDE
  2. correctCodes(&quot;bbcde&quot;) # returns ABCDE
  3. correctCodes(&quot;efghj&quot;) # returns EFGHI

However, if I want to use it within dplyr::mutate() it doesn't work at all:

  1. df %&gt;%
  2. mutate(code_corrected = correctCodes(code))

答案1

得分: 1

以下是您要翻译的内容:

  1. mutate将整个列作为向量传递给函数,但该函数目前只能接受单个字符串。
  1. correctCodes(df$code)
  2. #&gt; character(0)

mutate之前使用rowwise()将起作用:

  1. df %&gt;%
  2. rowwise() %&gt;%
  3. mutate(code_corrected = correctCodes(code))

但还有两个问题...

  1. 传递正确的代码不起作用,因为Var1 != Var2将其作为过滤选项删除:
  1. correctCodes(&quot;abcde&quot;)
  2. #&gt; character(0)
  1. 传递完全错误的字符串不起作用,因为这个字符串的得分为0.667,因此不满足0.8的阈值:
  1. correctCodes(&quot;efigh&quot;)
  2. #&gt; character(0)

如果构建一个检查"代码已经正确"的功能,那么这将绕过问题#2。问题#3可能需要不同的解决方案(或选择得分最高的匹配项?)

编辑 - 可能的编辑函数/调用作为解决问题的一种方式:

  1. correctCodes &lt;- function(wrongCode) {
  2. wrongCode &lt;- trimws(toupper(wrongCode))
  3. counts &lt;- df %&gt;%
  4. mutate(code = trimws(toupper(code))) %&gt;%
  5. count(code) %&gt;%
  6. filter(n &gt; 2) |&gt;
  7. rename(Var2 = code)
  8. right_codes &lt;- counts |&gt;
  9. pull(Var2)
  10. if (wrongCode %in% right_codes) return(wrongCode)
  11. expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %&gt;%
  12. mutate(similarity = mylevsim(Var1, Var2)) %&gt;%
  13. arrange(Var1, desc(similarity)) %&gt;%
  14. inner_join(counts, by = &quot;Var2&quot;) %&gt;%
  15. mutate(both = paste(Var1, Var2)) %&gt;%
  16. filter(!duplicated(both),
  17. Var1 != Var2,
  18. Var1 == wrongCode) %&gt;%
  19. arrange(desc(similarity)) %&gt;%
  20. filter(row_number()==1) %&gt;%
  21. pull(Var2)
  22. }
  23. df %&gt;%
  24. rowwise() %&gt;%
  25. mutate(code_corrected = correctCodes(code))
  26. #&gt; # A tibble: 44 &#215; 2
  27. #&gt; # Rowwise:
  28. #&gt; code code_corrected
  29. #&gt; &lt;chr&gt; &lt;chr&gt;
  30. #&gt; 1 abcde ABCDE
  31. #&gt; 2 abcde ABCDE
  32. #&gt; 3 abcde ABCDE
  33. #&gt; 4 abcde ABCDE
  34. #&gt; 5 abcde ABCDE
  35. #&gt; 6 abcde ABCDE
  36. #&gt; 7 abcde ABCDE
  37. #&gt; 8 abcde ABCDE
  38. #&gt; 9 abcde ABCDE
  39. #&gt; 10 abcde ABCDE
  40. #&gt; # ℹ 34 more rows

希望这有助于您理解这段代码和问题的解决方法。

英文:

There are at least three things stopping your code from working correctly at the moment:

  1. mutate passes the whole column as a vector to the function, which can currently only take a single string.
  1. correctCodes(df$code)
  2. #&gt; character(0)

Using rowwise() ahead of mutate would work:

  1. df %&gt;%
  2. rowwise() %&gt;%
  3. mutate(code_corrected = correctCodes(code))

But there are two further problems...

  1. Passing a correct code doesn't work as Var1 != Var2 drops that as an option in your filtering:
  1. correctCodes(&quot;abcde&quot;)
  2. #&gt; character(0)
  1. Passing a super-wrong string doesn't work, as this one gives a score of 0.667 so doesn't pass the 0.8 threshold:
  1. correctCodes(&quot;efigh&quot;)
  2. #&gt; character(0)

If you build in a check for "code already correct" then this would bypass problem #2. Problem #3 may require a different solution though (or choose the highest scoring match?)

Edit - possible edited function/call as one way of solving problem:

  1. correctCodes &lt;- function(wrongCode) {
  2. wrongCode &lt;- trimws(toupper(wrongCode))
  3. counts &lt;- df %&gt;%
  4. mutate(code = trimws(toupper(code))) %&gt;%
  5. count(code) %&gt;%
  6. filter(n &gt; 2) |&gt;
  7. rename(Var2 = code)
  8. right_codes &lt;- counts |&gt;
  9. pull(Var2)
  10. if (wrongCode %in% right_codes) return(wrongCode)
  11. expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %&gt;%
  12. mutate(similarity = mylevsim(Var1, Var2)) %&gt;%
  13. arrange(Var1, desc(similarity)) %&gt;%
  14. inner_join(counts, by = &quot;Var2&quot;) %&gt;%
  15. mutate(both = paste(Var1, Var2)) %&gt;%
  16. filter(!duplicated(both),
  17. Var1 != Var2,
  18. Var1 == wrongCode) %&gt;%
  19. arrange(desc(similarity)) %&gt;%
  20. filter(row_number()==1) %&gt;%
  21. pull(Var2)
  22. }
  23. df %&gt;%
  24. rowwise() %&gt;%
  25. mutate(code_corrected = correctCodes(code))
  26. #&gt; # A tibble: 44 &#215; 2
  27. #&gt; # Rowwise:
  28. #&gt; code code_corrected
  29. #&gt; &lt;chr&gt; &lt;chr&gt;
  30. #&gt; 1 abcde ABCDE
  31. #&gt; 2 abcde ABCDE
  32. #&gt; 3 abcde ABCDE
  33. #&gt; 4 abcde ABCDE
  34. #&gt; 5 abcde ABCDE
  35. #&gt; 6 abcde ABCDE
  36. #&gt; 7 abcde ABCDE
  37. #&gt; 8 abcde ABCDE
  38. #&gt; 9 abcde ABCDE
  39. #&gt; 10 abcde ABCDE
  40. #&gt; # ℹ 34 more rows

答案2

得分: 1

  1. 需要对这个函数进行向量化:
  2. vectorized_correctCodes <- Vectorize(correctCodes)
  3. df %>%
  4. mutate(code_corrected = vectorized_correctCodes(code))
  5. code code_corrected
  6. 1 abcde
  7. 2 abcde
  8. 3 abcde
  9. 4 abcde
  10. 5 abcde
  11. 6 abcde
  12. 7 abcde
  13. 8 abcde
  14. 9 abcde
  15. 10 abcde
  16. 11 abcde
  17. 12 abcde
  18. 13 abcde
  19. 14 abcde
  20. 15 abcde
  21. 16 abcde
  22. 17 abcde
  23. 18 abcde
  24. 19 abcde
  25. 20 abcde
  26. 21 bbcde ABCDE
  27. 22 abccde ABCDE
  28. 23 efghi
  29. 24 efghi
  30. 25 efghi
  31. 26 efghi
  32. 27 efghi
  33. 28 efghi
  34. 29 efghi
  35. 30 efghi
  36. 31 efghi
  37. 32 efghi
  38. 33 efghi
  39. 34 efghi
  40. 35 efghi
  41. 36 efghi
  42. 37 efghi
  43. 38 efghi
  44. 39 efghi
  45. 40 efghi
  46. 41 efghi
  47. 42 efghi
  48. 43 efigh
  49. 44 efghj EFGHI
英文:

You need to vectorize the function:

  1. vectorized_correctCodes &lt;- Vectorize(correctCodes)
  2. df %&gt;%
  3. mutate(code_corrected = vectorized_correctCodes(code))
  4. code code_corrected
  5. 1 abcde
  6. 2 abcde
  7. 3 abcde
  8. 4 abcde
  9. 5 abcde
  10. 6 abcde
  11. 7 abcde
  12. 8 abcde
  13. 9 abcde
  14. 10 abcde
  15. 11 abcde
  16. 12 abcde
  17. 13 abcde
  18. 14 abcde
  19. 15 abcde
  20. 16 abcde
  21. 17 abcde
  22. 18 abcde
  23. 19 abcde
  24. 20 abcde
  25. 21 bbcde ABCDE
  26. 22 abccde ABCDE
  27. 23 efghi
  28. 24 efghi
  29. 25 efghi
  30. 26 efghi
  31. 27 efghi
  32. 28 efghi
  33. 29 efghi
  34. 30 efghi
  35. 31 efghi
  36. 32 efghi
  37. 33 efghi
  38. 34 efghi
  39. 35 efghi
  40. 36 efghi
  41. 37 efghi
  42. 38 efghi
  43. 39 efghi
  44. 40 efghi
  45. 41 efghi
  46. 42 efghi
  47. 43 efigh
  48. 44 efghj EFGHI

答案3

得分: 0

I would use a different approach. As you solely depend on user input and you can assume that more people enter it correctly than wrong. I translated that in any entry > 2 we trust (feel free to change) and that means that the other ones we try to fix.

我会采用不同的方法。因为你完全依赖用户输入,可以假设输入正确的人比错误的人多。我将这个翻译为任何条目 > 2,我们信任(随时可以更改),这意味着其他条目我们尝试修复。

For that we split the data, then do a cartesian join and get all combinations we need to calculate the distance for. Then we check for 0.8 similarity and up and pick the best match if more than one was found. Then join your data back. for illustration I just added the column trusted and the similarity. So those you can fix your codes for.

为此,我们拆分数据,然后进行笛卡尔连接,并获取我们需要计算距离的所有组合。然后我们检查相似度是否大于等于0.8,并在找到多个匹配项时选择最佳匹配。然后将数据重新连接。为了说明问题,我只添加了一个名为trusted和相似度的列。所以你可以根据这些来修复你的代码。

code trusted similarity

1: ABCDE NA

2: ABCDE NA

3: ABCDE NA

4: ABCDE NA

5: ABCDE NA

6: ABCDE NA

7: ABCDE NA

8: ABCDE NA

9: ABCDE NA

10: ABCDE NA

11: ABCDE NA

12: ABCDE NA

13: ABCDE NA

14: ABCDE NA

15: ABCDE NA

16: ABCDE NA

17: ABCDE NA

18: ABCDE NA

19: ABCDE NA

20: ABCDE NA

21: BBCDE ABCDE 0.80000

22: ABCCDE ABCDE 0.83333

23: EFGHI NA

24: EFGHI NA

25: EFGHI NA

26: EFGHI NA

27: EFGHI NA

28: EFGHI NA

29: EFGHI NA

30: EFGHI NA

31: EFGHI NA

32: EFGHI NA

33: EFGHI NA

34: EFGHI NA

35: EFGHI NA

36: EFGHI NA

37: EFGHI NA

38: EFGHI NA

39: EFGHI NA

40: EFGHI NA

41: EFGHI NA

42: EFGHI NA

43: EFIGH NA

44: EFGHJ EFGHI 0.80000

英文:

I would use a different approach. As you solely depend on user input and you can assume that more people enter it correctly than wrong. I translated that in any entry > 2 we trust (feel free to change) and that means that the other ones we try to fix.

For that we split the data, then do a cartesian join and get all combinations we need to calculate the distance for. Then we check for 0.8 similarity and up and pick the best match if more than one was found. Then join your data back. for illustration I just added the column trusted and the similarity. So those you can fix your codes for.

  1. library(data.table)
  2. library(stringdist)
  3. setDT(df)
  4. # standardize your codes
  5. df[, code := trimws(toupper(code))]
  6. # split your data into trusted codes and those you want to &quot;verify / fix&quot;
  7. trusted &lt;- df[, .N, code][N &gt; 2, code]
  8. verify &lt;- df[!code %in% trusted]
  9. # do a cartesian join to get all combinations you want to calculate the distance
  10. check &lt;- verify[, .(trusted = unlist(trusted)), by = verify]
  11. # calculate similarity, I used stringdist package here
  12. check &lt;- check[, similarity := stringsim(code, trusted, &quot;lv&quot;)][similarity &gt;= 0.8][, .SD[which.max(similarity)], code]
  13. # join back, from there you can change them if you like
  14. check[df, on = c(&quot;code&quot;)]
  15. # code trusted similarity
  16. # 1: ABCDE &lt;NA&gt; NA
  17. # 2: ABCDE &lt;NA&gt; NA
  18. # 3: ABCDE &lt;NA&gt; NA
  19. # 4: ABCDE &lt;NA&gt; NA
  20. # 5: ABCDE &lt;NA&gt; NA
  21. # 6: ABCDE &lt;NA&gt; NA
  22. # 7: ABCDE &lt;NA&gt; NA
  23. # 8: ABCDE &lt;NA&gt; NA
  24. # 9: ABCDE &lt;NA&gt; NA
  25. # 10: ABCDE &lt;NA&gt; NA
  26. # 11: ABCDE &lt;NA&gt; NA
  27. # 12: ABCDE &lt;NA&gt; NA
  28. # 13: ABCDE &lt;NA&gt; NA
  29. # 14: ABCDE &lt;NA&gt; NA
  30. # 15: ABCDE &lt;NA&gt; NA
  31. # 16: ABCDE &lt;NA&gt; NA
  32. # 17: ABCDE &lt;NA&gt; NA
  33. # 18: ABCDE &lt;NA&gt; NA
  34. # 19: ABCDE &lt;NA&gt; NA
  35. # 20: ABCDE &lt;NA&gt; NA
  36. # 21: BBCDE ABCDE 0.80000
  37. # 22: ABCCDE ABCDE 0.83333
  38. # 23: EFGHI &lt;NA&gt; NA
  39. # 24: EFGHI &lt;NA&gt; NA
  40. # 25: EFGHI &lt;NA&gt; NA
  41. # 26: EFGHI &lt;NA&gt; NA
  42. # 27: EFGHI &lt;NA&gt; NA
  43. # 28: EFGHI &lt;NA&gt; NA
  44. # 29: EFGHI &lt;NA&gt; NA
  45. # 30: EFGHI &lt;NA&gt; NA
  46. # 31: EFGHI &lt;NA&gt; NA
  47. # 32: EFGHI &lt;NA&gt; NA
  48. # 33: EFGHI &lt;NA&gt; NA
  49. # 34: EFGHI &lt;NA&gt; NA
  50. # 35: EFGHI &lt;NA&gt; NA
  51. # 36: EFGHI &lt;NA&gt; NA
  52. # 37: EFGHI &lt;NA&gt; NA
  53. # 38: EFGHI &lt;NA&gt; NA
  54. # 39: EFGHI &lt;NA&gt; NA
  55. # 40: EFGHI &lt;NA&gt; NA
  56. # 41: EFGHI &lt;NA&gt; NA
  57. # 42: EFGHI &lt;NA&gt; NA
  58. # 43: EFIGH &lt;NA&gt; NA
  59. # 44: EFGHJ EFGHI 0.80000

huangapple
  • 本文由 发表于 2023年4月17日 19:05:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76034467.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定