英文:
Regex search across two tables
问题
以下是已翻译的代码部分:
我有具有name列和class列的以下数据框:
df=data.frame(name=c("name1","name2","name3","name4"), class=c("classA","classA","classB","classC"))
我有以下包含名称组合的数据框:
df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"))
我需要通过执行正则表达式搜索来确定df2中名称的分类,并标记它们属于哪个类别(classes),并创建一个标志(class_flag):
df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"),class_flag=c("Y","N","Y","Y"),classes=c("classA,classB", NA, "classA","classC"))
我现在有一个繁琐的过程要做,我将展示出来,但我在思考是否有更好的方法来做,而且我需要为每个类别都这样做,我只是在下面的一个类别中展示出来:
class_A_search=paste(paste0("\\b",toupper(df$name),collapse = "|"))
df2%>%mutate(class_flag=ifelse(str_detect(toupper(names),class_A_search),"Y","N"), class_A=ifelse(str_detect(toupper(names),class_A_search),"classA",NA))
这是数据集的简化版本,df2有100万行,名称列不仅限于这里显示的名称。
英文:
I have the following df with a name column and class column:
df=data.frame(name=c("name1","name2","name3","name4"), class=c("classA","classA","classB","classC"))
and I have the following dataframe where it contains a combinations of names:
df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"))
I need to determine the classification of the names in df2 by performing a regex search, and labeling which class they belong to (classes) and also create a flag (class_flag):
df2=data.frame(names=c("name1;name3","name5,name6","name2 name 8","name4"),class_flag=c("Y","N","Y","Y"),classes=c("classA,classB", NA, "classA","classC"))
I have a cumbersome process to do this now which I will show but I was thinking there is a better way to do it, and I would have to do it for every class, I'm just showing it for one of the classes below:
class_A_search=paste(paste0("\\b",toupper(df$name),collapse = "|"))
df2%>%mutate(class_flag=ifelse(str_detect(toupper(names),class_A_search),"Y","N"), class_A=ifelse(str_detect(toupper(names),class_A_search),"classA",NA))
This is a simplified version of the dataset, and df2 has 1 million rows and the names column isn't limited to the ones shown here.
答案1
得分: 2
以下是翻译好的部分:
使用 regmatches
的基本R选项可能会有所帮助
transform(
transform(
df2,
classes = sapply(
regmatches(names, gregexpr(paste0(df$name, collapse = "|"), names)),
\(x) {
with(
df,
replace(
d <- paste0(unique(class[match(x, name)]), collapse = ","),
!nzchar(d),
NA
)
)
}
)
),
class_flag = c("Y", "N")[1 + is.na(classes)]
)
结果如下
names classes class_flag
1 name1;name3 classA,classB Y
2 name5,name6 <NA> N
3 name2 name 8 classA Y
4 name4 classC Y
使用 dplyr
,我们可以尝试
df2 %>%
mutate(name = regmatches(names, gregexpr("name\\d", names))) %>%
unnest(name) %>%
left_join(df) %>%
summarise(
classes = str_c(unique(class), collapse = ";"),
class_flag = c("Y", "N")[1 + (sum(is.na(class)) == n())],
.by = names
)
结果如下
# A tibble: 4 × 3
names classes class_flag
1 name1;name3 classA;classB Y
2 name5,name6 NA N
3 name2 name 8 classA Y
4 name4 classC Y
英文:
Probably a base R option with regmatches
could help
transform(
transform(
df2,
classes = sapply(
regmatches(names, gregexpr(paste0(df$name, collapse = "|"), names)),
\(x) {
with(
df,
replace(
d <- paste0(unique(class[match(x, name)]), collapse = ","),
!nzchar(d),
NA
)
)
}
)
),
class_flag = c("Y", "N")[1 + is.na(classes)]
)
which gives
names classes class_flag
1 name1;name3 classA,classB Y
2 name5,name6 <NA> N
3 name2 name 8 classA Y
4 name4 classC Y
Using dplyr
, we can try
df2 %>%
mutate(name = regmatches(names, gregexpr("name\\d", names))) %>%
unnest(name) %>%
left_join(df) %>%
summarise(
classes = str_c(unique(class), collapse = ";"),
class_flag = c("Y", "N")[1 + (sum(is.na(class)) == n())],
.by = names
)
which gives
# A tibble: 4 × 3
names classes class_flag
<chr> <chr> <chr>
1 name1;name3 classA;classB Y
2 name5,name6 NA N
3 name2 name 8 classA Y
4 name4 classC Y
答案2
得分: 1
以下是翻译的代码部分:
library(tidyverse)
lookup <- deframe(df)
df2 %>%
mutate(class = str_extract_all(names, str_c("\\b", df$name, "\\b", collapse = "|")),
class = map(class, ~ set_names(unname(lookup[.x]))),
class_flag = ifelse(lengths(class), "Y", "N"),
unnest_class = class) %>%
unnest_wider(unnest_class)
请注意,这是R语言中的代码,用于使用tidyverse包进行数据处理。如果您需要更多帮助,请告诉我。
英文:
Here is a tidyverse
option:
library(tidyverse)
lookup <- deframe(df)
df2 |>
mutate(class = str_extract_all(names, str_c("\\b", df$name, "\\b", collapse = "|")),
class = map(class, ~ set_names(unname(lookup[.x]))),
class_flag = ifelse(lengths(class), "Y", "N"),
unnest_class = class) |>
unnest_wider(unnest_class)
How it works
str_extract_all
returns a list-column with each list element corresponding to a row in the data frame. Each list element is a vector of extracted regular expression matches.- Since
class
is a list-column we usemap
to iterate over it and look up the extracted value in the named vector created bydeframe(df)
.set_names
is used to create a named vector in each list element for un-nesting wider in step #4 (these become the column names). class
is still a list-column. I left it that way because there is a lot of functionality in R in dealing with lists rather than collapsing it to a string.- Create
class_flag
by usinglengths
, which returns 0 (equivalent ofFALSE
) if the list element is empty (e.g.,character(0)
). - Make a copy of the list-column
class
namedunnest_class
, which we unnest into columns.
If you really need class
as a character column, you can pipe this output to mutate(class = map_chr(class, str_flatten_comma))
.
Output
names class class_flag classA classB classC
<chr> <list> <chr> <chr> <chr> <chr>
1 name1;name3 <chr [2]> Y classA classB NA
2 name5,name6 <chr [0]> N NA NA NA
3 name2 name 8 <chr [1]> Y classA NA NA
4 name4 <chr [1]> Y NA NA classC
Benchmark
Increasing the number of rows in df2
to 100,000 to get a better sense of performance.
If you do not need the columns classA
, classB
, etc. and remove the unnest_wider
pipe, this answer is more comparable to the one posted by @ThomasIsCoding. In the case without unnest_wider
I found this answer to be faster, but I kept it in the benchmark because it looks like you want those columns:
set.seed(1)
df2 <- df2[sample(1:nrow(df2), 1E5, replace = T),, drop = F]
(bench <- microbenchmark::microbenchmark(
stringr = {lookup <- deframe(df)
df2 |>
mutate(class = str_extract_all(names, str_c("\\b", df$name, "\\b", collapse = "|")),
class = map(class, ~ set_names(unname(lookup[.x]))),
class_flag = ifelse(lengths(class), "Y", "N"),
unnest_class = class) |>
unnest_wider(unnest_class)},
baseR = {transform(
transform(
df2,
classes = sapply(
regmatches(names, gregexpr(paste0(df$name, collapse = "|"), names)),
\(x) {
with(
df,
replace(
d <- paste0(unique(class[match(x, name)]), collapse = ","),
!nzchar(d),
NA
)
)
}
)
),
class_flag = c("Y", "N")[1 + is.na(classes)]
)},
times = 20L,
unit = "seconds"
))
Unit: seconds
expr min lq mean median uq max neval cld
stringr 3.759750 4.249305 4.461382 4.527822 4.732876 4.953810 20 a
baseR 2.736081 2.835327 3.019493 3.044236 3.137328 3.427364 20 b
ggplot2::autoplot(bench)
答案3
得分: 0
以下是您要翻译的内容:
Other ways include:
library(tidyverse)
pat <- str_c(df$name, collapse = "|")
df2 %>%
mutate(classes = map_chr(str_extract_all(names, pat), toString) %>%
str_replace_all(names, deframe(df))%>%
na_if(""),
class_flag = c("Y", "N")[1+is.na(classes)])
#> names classes class_flag
#> 1 name1;name3 name1, name3 Y
#> 2 name5,name6 <NA> N
#> 3 name2 name 8 name2 Y
#> 4 name4 classC Y
df2 %>%
mutate(classes =na_if(str_remove_all(names, sprintf("\\b((?!%s)\\W?)+", pat)), ''),
class_flag = c("Y", "N")[1+is.na(classes)])
#> names classes class_flag
#> 1 name1;name3 name1;name3 Y
#> 2 name5,name6 <NA> N
#> 3 name2 name 8 name2 Y
#> 4 name4 name4 Y
<sup>Created on 2023-06-28 with reprex v2.0.2</sup>
英文:
Other ways include:
library(tidyverse)
pat <- str_c(df$name, collapse = "|")
df2 %>%
mutate(classes = map_chr(str_extract_all(names, pat), toString) %>%
str_replace_all(names, deframe(df))%>%
na_if(""),
class_flag = c("Y", "N")[1+is.na(classes)])
#> names classes class_flag
#> 1 name1;name3 name1, name3 Y
#> 2 name5,name6 <NA> N
#> 3 name2 name 8 name2 Y
#> 4 name4 classC Y
df2 %>%
mutate(classes =na_if(str_remove_all(names, sprintf("\\b((?!%s)\\W?)+", pat)), ''),
class_flag = c("Y", "N")[1+is.na(classes)])
#> names classes class_flag
#> 1 name1;name3 name1;name3 Y
#> 2 name5,name6 <NA> N
#> 3 name2 name 8 name2 Y
#> 4 name4 name4 Y
<sup>Created on 2023-06-28 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论