正则表达式搜索跨越两个表格

huangapple go评论97阅读模式
英文:

Regex search across two tables

问题

以下是已翻译的代码部分:

  1. 我有具有name列和class列的以下数据框:
  2. df=data.frame(name=c("name1","name2","name3","name4"), class=c("classA","classA","classB","classC"))
  3. 我有以下包含名称组合的数据框:
  4. df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"))
  5. 我需要通过执行正则表达式搜索来确定df2中名称的分类,并标记它们属于哪个类别(classes),并创建一个标志(class_flag):
  6. df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"),class_flag=c("Y","N","Y","Y"),classes=c("classA,classB", NA, "classA","classC"))
  7. 我现在有一个繁琐的过程要做,我将展示出来,但我在思考是否有更好的方法来做,而且我需要为每个类别都这样做,我只是在下面的一个类别中展示出来:
  8. class_A_search=paste(paste0("\\b",toupper(df$name),collapse = "|"))
  9. df2%>%mutate(class_flag=ifelse(str_detect(toupper(names),class_A_search),"Y","N"), class_A=ifelse(str_detect(toupper(names),class_A_search),"classA",NA))
  10. 这是数据集的简化版本,df2100万行,名称列不仅限于这里显示的名称。
英文:

I have the following df with a name column and class column:

  1. df=data.frame(name=c("name1","name2","name3","name4"), class=c("classA","classA","classB","classC"))

and I have the following dataframe where it contains a combinations of names:

  1. df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"))

I need to determine the classification of the names in df2 by performing a regex search, and labeling which class they belong to (classes) and also create a flag (class_flag):

  1. df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"),class_flag=c("Y","N","Y","Y"),classes=c("classA,classB", NA, "classA","classC"))

I have a cumbersome process to do this now which I will show but I was thinking there is a better way to do it, and I would have to do it for every class, I'm just showing it for one of the classes below:

  1. class_A_search=paste(paste0("\\b",toupper(df$name),collapse = "|"))
  2. df2%>%mutate(class_flag=ifelse(str_detect(toupper(names),class_A_search),"Y","N"), class_A=ifelse(str_detect(toupper(names),class_A_search),"classA",NA))

This is a simplified version of the dataset, and df2 has 1 million rows and the names column isn't limited to the ones shown here.

答案1

得分: 2

以下是翻译好的部分:

使用 regmatches 的基本R选项可能会有所帮助

  1. transform(
  2. transform(
  3. df2,
  4. classes = sapply(
  5. regmatches(names, gregexpr(paste0(df$name, collapse = "|"), names)),
  6. \(x) {
  7. with(
  8. df,
  9. replace(
  10. d <- paste0(unique(class[match(x, name)]), collapse = ","),
  11. !nzchar(d),
  12. NA
  13. )
  14. )
  15. }
  16. )
  17. ),
  18. class_flag = c("Y", "N")[1 + is.na(classes)]
  19. )

结果如下

  1. names classes class_flag
  2. 1 name1;name3 classA,classB Y
  3. 2 name5,name6 <NA> N
  4. 3 name2 name 8 classA Y
  5. 4 name4 classC Y

使用 dplyr,我们可以尝试

  1. df2 %>%
  2. mutate(name = regmatches(names, gregexpr("name\\d", names))) %>%
  3. unnest(name) %>%
  4. left_join(df) %>%
  5. summarise(
  6. classes = str_c(unique(class), collapse = ";"),
  7. class_flag = c("Y", "N")[1 + (sum(is.na(class)) == n())],
  8. .by = names
  9. )

结果如下

  1. # A tibble: 4 × 3
  2. names classes class_flag
  3. 1 name1;name3 classA;classB Y
  4. 2 name5,name6 NA N
  5. 3 name2 name 8 classA Y
  6. 4 name4 classC Y
英文:

Probably a base R option with regmatches could help

  1. transform(
  2. transform(
  3. df2,
  4. classes = sapply(
  5. regmatches(names, gregexpr(paste0(df$name, collapse = &quot;|&quot;), names)),
  6. \(x) {
  7. with(
  8. df,
  9. replace(
  10. d &lt;- paste0(unique(class[match(x, name)]), collapse = &quot;,&quot;),
  11. !nzchar(d),
  12. NA
  13. )
  14. )
  15. }
  16. )
  17. ),
  18. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1 + is.na(classes)]
  19. )

which gives

  1. names classes class_flag
  2. 1 name1;name3 classA,classB Y
  3. 2 name5,name6 &lt;NA&gt; N
  4. 3 name2 name 8 classA Y
  5. 4 name4 classC Y

Using dplyr, we can try

  1. df2 %&gt;%
  2. mutate(name = regmatches(names, gregexpr(&quot;name\\d&quot;, names))) %&gt;%
  3. unnest(name) %&gt;%
  4. left_join(df) %&gt;%
  5. summarise(
  6. classes = str_c(unique(class), collapse = &quot;;&quot;),
  7. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1 + (sum(is.na(class)) == n())],
  8. .by = names
  9. )

which gives

  1. # A tibble: 4 &#215; 3
  2. names classes class_flag
  3. &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 name1;name3 classA;classB Y
  5. 2 name5,name6 NA N
  6. 3 name2 name 8 classA Y
  7. 4 name4 classC Y

答案2

得分: 1

以下是翻译的代码部分:

  1. library(tidyverse)
  2. lookup <- deframe(df)
  3. df2 %>%
  4. mutate(class = str_extract_all(names, str_c("\\b", df$name, "\\b", collapse = "|")),
  5. class = map(class, ~ set_names(unname(lookup[.x]))),
  6. class_flag = ifelse(lengths(class), "Y", "N"),
  7. unnest_class = class) %>%
  8. unnest_wider(unnest_class)

请注意,这是R语言中的代码,用于使用tidyverse包进行数据处理。如果您需要更多帮助,请告诉我。

英文:

Here is a tidyverse option:

  1. library(tidyverse)
  2. lookup &lt;- deframe(df)
  3. df2 |&gt;
  4. mutate(class = str_extract_all(names, str_c(&quot;\\b&quot;, df$name, &quot;\\b&quot;, collapse = &quot;|&quot;)),
  5. class = map(class, ~ set_names(unname(lookup[.x]))),
  6. class_flag = ifelse(lengths(class), &quot;Y&quot;, &quot;N&quot;),
  7. unnest_class = class) |&gt;
  8. unnest_wider(unnest_class)

How it works

  1. str_extract_all returns a list-column with each list element corresponding to a row in the data frame. Each list element is a vector of extracted regular expression matches.
  2. Since class is a list-column we use map to iterate over it and look up the extracted value in the named vector created by deframe(df). set_names is used to create a named vector in each list element for un-nesting wider in step #4 (these become the column names).
  3. class is still a list-column. I left it that way because there is a lot of functionality in R in dealing with lists rather than collapsing it to a string.
  4. Create class_flag by using lengths, which returns 0 (equivalent of FALSE) if the list element is empty (e.g., character(0)).
  5. Make a copy of the list-column class named unnest_class, which we unnest into columns.

If you really need class as a character column, you can pipe this output to mutate(class = map_chr(class, str_flatten_comma)).

Output

  1. names class class_flag classA classB classC
  2. &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  3. 1 name1;name3 &lt;chr [2]&gt; Y classA classB NA
  4. 2 name5,name6 &lt;chr [0]&gt; N NA NA NA
  5. 3 name2 name 8 &lt;chr [1]&gt; Y classA NA NA
  6. 4 name4 &lt;chr [1]&gt; Y NA NA classC

Benchmark

Increasing the number of rows in df2 to 100,000 to get a better sense of performance.

If you do not need the columns classA, classB, etc. and remove the unnest_wider pipe, this answer is more comparable to the one posted by @ThomasIsCoding. In the case without unnest_wider I found this answer to be faster, but I kept it in the benchmark because it looks like you want those columns:

  1. set.seed(1)
  2. df2 &lt;- df2[sample(1:nrow(df2), 1E5, replace = T),, drop = F]
  3. (bench &lt;- microbenchmark::microbenchmark(
  4. stringr = {lookup &lt;- deframe(df)
  5. df2 |&gt;
  6. mutate(class = str_extract_all(names, str_c(&quot;\\b&quot;, df$name, &quot;\\b&quot;, collapse = &quot;|&quot;)),
  7. class = map(class, ~ set_names(unname(lookup[.x]))),
  8. class_flag = ifelse(lengths(class), &quot;Y&quot;, &quot;N&quot;),
  9. unnest_class = class) |&gt;
  10. unnest_wider(unnest_class)},
  11. baseR = {transform(
  12. transform(
  13. df2,
  14. classes = sapply(
  15. regmatches(names, gregexpr(paste0(df$name, collapse = &quot;|&quot;), names)),
  16. \(x) {
  17. with(
  18. df,
  19. replace(
  20. d &lt;- paste0(unique(class[match(x, name)]), collapse = &quot;,&quot;),
  21. !nzchar(d),
  22. NA
  23. )
  24. )
  25. }
  26. )
  27. ),
  28. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1 + is.na(classes)]
  29. )},
  30. times = 20L,
  31. unit = &quot;seconds&quot;
  32. ))
  33. Unit: seconds
  34. expr min lq mean median uq max neval cld
  35. stringr 3.759750 4.249305 4.461382 4.527822 4.732876 4.953810 20 a
  36. baseR 2.736081 2.835327 3.019493 3.044236 3.137328 3.427364 20 b
  37. ggplot2::autoplot(bench)

正则表达式搜索跨越两个表格

答案3

得分: 0

以下是您要翻译的内容:

Other ways include:

  1. library(tidyverse)
  2. pat &lt;- str_c(df$name, collapse = &quot;|&quot;)
  3. df2 %&gt;%
  4. mutate(classes = map_chr(str_extract_all(names, pat), toString) %&gt;%
  5. str_replace_all(names, deframe(df))%&gt;%
  6. na_if(&quot;&quot;),
  7. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1+is.na(classes)])
  8. #&gt; names classes class_flag
  9. #&gt; 1 name1;name3 name1, name3 Y
  10. #&gt; 2 name5,name6 &lt;NA&gt; N
  11. #&gt; 3 name2 name 8 name2 Y
  12. #&gt; 4 name4 classC Y
  13. df2 %&gt;%
  14. mutate(classes =na_if(str_remove_all(names, sprintf(&quot;\\b((?!%s)\\W?)+&quot;, pat)), &#39;&#39;),
  15. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1+is.na(classes)])
  16. #&gt; names classes class_flag
  17. #&gt; 1 name1;name3 name1;name3 Y
  18. #&gt; 2 name5,name6 &lt;NA&gt; N
  19. #&gt; 3 name2 name 8 name2 Y
  20. #&gt; 4 name4 name4 Y

<sup>Created on 2023-06-28 with reprex v2.0.2</sup>

英文:

Other ways include:

  1. library(tidyverse)
  2. pat &lt;- str_c(df$name, collapse = &quot;|&quot;)
  3. df2 %&gt;%
  4. mutate(classes = map_chr(str_extract_all(names, pat), toString) %&gt;%
  5. str_replace_all(names, deframe(df))%&gt;%
  6. na_if(&quot;&quot;),
  7. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1+is.na(classes)])
  8. #&gt; names classes class_flag
  9. #&gt; 1 name1;name3 name1, name3 Y
  10. #&gt; 2 name5,name6 &lt;NA&gt; N
  11. #&gt; 3 name2 name 8 name2 Y
  12. #&gt; 4 name4 classC Y
  13. df2 %&gt;%
  14. mutate(classes =na_if(str_remove_all(names, sprintf(&quot;\\b((?!%s)\\W?)+&quot;, pat)), &#39;&#39;),
  15. class_flag = c(&quot;Y&quot;, &quot;N&quot;)[1+is.na(classes)])
  16. #&gt; names classes class_flag
  17. #&gt; 1 name1;name3 name1;name3 Y
  18. #&gt; 2 name5,name6 &lt;NA&gt; N
  19. #&gt; 3 name2 name 8 name2 Y
  20. #&gt; 4 name4 name4 Y

<sup>Created on 2023-06-28 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月22日 01:21:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76525742.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定