如何按多个部分匹配字符串筛选 R 数据集,类似于 SQL 中的%通配符?

huangapple go评论57阅读模式
英文:

How to filter R dataset by multiple partial match strings, similar to SQL % wildcard?

问题

我有一个包含一个感兴趣字段和一个字符串列表(数百个字符串)的数据集。

我想要做的是,对于数据的每一行,检查该字段是否包含其中任何一个部分字符串。

实质上,我想要复制SQL的%通配符。所以,例如,如果一个值是"Game123",而我的字符串列表中有一个字符串是"Ga",我希望它们匹配(但我不希望"OGame"匹配"Ga")。

我希望编写类似这样的语句:

df %>%
filter(My_Field 包含 List_Of_Strings 中的任何一个)

如何填写这个筛选语句?

我尝试使用%in%运算符,但无法使其工作。我知道如何使用子字符串来检查单个字符串,但我有一长串字符串需要检查。

https://stackoverflow.com/questions/46215672/r-filter-rows-based-on-multiple-partial-strings-applied-to-multiple-columns:这个帖子与我尝试做的事情类似,但我的子字符串列表有400多个,所以我无法手动在grepl语句中写出所有这些(我认为?)

英文:

I have a dataset with with a field of interest and a list of strings (several hundred of them).

What I want to do is, for each line of the data, to check if the field has any of the partials strings in it.

Essentially, I want to replicate the SQL % wildcard. So, if for example a value is "Game123" and one of my strings is "Ga" I want that to be a match. (But I don't want "OGame" to match "Ga").

I'm hoping to write some statement like this:

df%>%
filter(My_Field contains any one of List_Of_Strings)

How do I fill in that filter statement?

I tried to use the %in% operator but couldn't make it work. I know how to use substrings to check against a single string, but I have a long list of them and need to check all of them.

https://stackoverflow.com/questions/46215672/r-filter-rows-based-on-multiple-partial-strings-applied-to-multiple-columns: This post is similar to what I'm trying to do, but my list of substrings is 400 plus, so I can't write it all out manually in a grepl statement (I think?)

答案1

得分: 1

我猜你面临的问题是这样的:

你有一个可以称为关键词的列表(你称之为“字符串列表”),以及一个包含文本的向量/列(你称之为“感兴趣的字段”),你的目标是根据关键词的存在与否来筛选向量/列。如果我理解正确,解决方案可能如下:

数据

a. 关键词列表:

keys <- c("how", "why", "what")

b. 包含文本向量/列的数据框:

df <- data.frame(
  text = c("Hi there", "How are you?", "I'm fine.", "So how's work?", "Ah kinda stressful.", "Why?", "Well you know")
)

解决方案

要在text中根据keys进行筛选,你需要将keys转换为正则表达式的替代模式(通过使用|来合并字符串)。根据你的keys,可能有用甚至是必要的,还要包括单词边界标记\\b(以防需要匹配keys值,但不能出现在其他单词中)。最后,如果大小写可能会有问题,我们可以使用不区分大小写的标志(?i)

df %>%
  filter(str_detect(text, str_c("(?i)\\b(", str_c(keys, collapse = "|"), ")\\b")))

这将筛选包含关键词的文本。

答案结束。

英文:

I guess the problem you're facing is this:

You have a list of what could be called key words (what you call "a list of strings") and a vector/column with text (what you call "a field of interest") and your goal is to filter the vector/column on whether or not any of the key words is present. If that's correct the solution might be this:

Data:

a. List of key words:

keys &lt;- c(&quot;how&quot;, &quot;why&quot;, &quot;what&quot;)

b. Dataframe with a vector/column of text:

df &lt;- data.frame(
  text = c(&quot;Hi there&quot;, &quot;How are you?&quot;, &quot;I&#39;m fine.&quot;, &quot;So how&#39;s work?&quot;, &quot;Ah kinda stressful.&quot;, &quot;Why?&quot;, &quot;Well you know&quot;)
)

Solution:

To filter df on keys in text you need to convert keys into a regex alternation pattern (by collapsing the strings with |). Depending on your keys it may be useful or even necessary to also include word \\boundary markers (in case the keys values need to match as such, but not occurring inside other words). And finally, if there may be an issue with lower- or upper-case, we can use the case-insensitive flag (?i):

df %&gt;%
  filter(str_detect(text, str_c(&quot;(?i)\\b(&quot;, str_c(keys, collapse = &quot;|&quot;), &quot;)\\b&quot;)))
            text
1   How are you?
2 So how&#39;s work?
3           Why? 

答案2

得分: 0

由于没有特定的数据集或可复制的示例,我可以想到一种使用两个apply函数和巧妙使用正则表达式来实现的方法。请记住正则表达式操作符^仅在其后的表达式出现在字符串开头时匹配。

library(dplyr)

MyField <- c("OGame","Game123","Duck","Dugame","Aldubame")

df <- data.frame(MyField)

ListOfStrings <- c("^Ga","^Du") #注意这里使用了^

match_s <- function(patterns, entry){
  lapply(patterns, grepl, x = entry) %>% unlist() %>% any()
}

df$match_string <- lapply(df$MyField, match_s, entry = ListOfStrings)

df %>% filter(match_string == 1)

注意:我已经保留了代码部分的原文,只翻译了注释和字符串部分。

英文:

Since there is no particular dataset or reproductible example, I can think of a way to implement it with two apply functions and a smart use of regex. Remember that the regex operator ^ matches only if the following expression shows up in its beginning.

library(dplyr)

MyField &lt;- c(&quot;OGame&quot;,&quot;Game123&quot;,&quot;Duck&quot;,&quot;Dugame&quot;,&quot;Aldubame&quot;)

df &lt;- data.frame(MyField)

ListOfStrings &lt;- c(&quot;^Ga&quot;,&quot;^Du&quot;) #Notice the use of ^ here

match_s &lt;- function(patterns,entry){
  lapply(patterns,grepl,x = entry) %&gt;% unlist() %&gt;% any()
}

df$match_string &lt;- lapply(df$MyField, match_s, pat = ListOfStrings)

df %&gt;% filter(match_string == 1)

答案3

得分: 0

使用 dplyr(以stringr中的wordssentences作为示例),以及与 \\b 结合使用 grepl 以获得单词边界匹配的开头。

library(stringr)
library(dplyr)

set.seed(22)

tibble(sentences) %>% 
  rowwise() %>% 
  filter(any(sapply(words[sample(length(words), 10)], function(x) 
    grepl(paste0("\\b", x), sentences)))) %>% 
  ungroup()
# A tibble: 32 × 1
   sentences                                      
   <chr>                                          
 1 It's easy to tell the depth of a well.         
 2 Kick the ball straight and follow through.     
 3 A king ruled the state in the early days.      
 4 March the soldiers past the next hill.         
 5 The dune rose from the edge of the water.      
 6 The grass curled around the fence post.        
 7 Cats and Dogs each hate the other.             
 8 The harder he tried the less he got done.      
 9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows
英文:

With dplyr (using stringr for words and sentences as examples) and grepl in conjunction with \\b to get the word boundary match at the beginning.

library(stringr)
library(dplyr)

set.seed(22)

tibble(sentences) %&gt;% 
  rowwise() %&gt;% 
  filter(any(sapply(words[sample(length(words), 10)], function(x) 
    grepl(paste0(&quot;\\b&quot;, x), sentences)))) %&gt;% 
  ungroup()
# A tibble: 32 &#215; 1
   sentences                                    
   &lt;chr&gt;                                        
 1 It&#39;s easy to tell the depth of a well.       
 2 Kick the ball straight and follow through.   
 3 A king ruled the state in the early days.    
 4 March the soldiers past the next hill.       
 5 The dune rose from the edge of the water.    
 6 The grass curled around the fence post.      
 7 Cats and Dogs each hate the other.           
 8 The harder he tried the less he got done.    
 9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows

huangapple
  • 本文由 发表于 2023年2月9日 00:27:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75388817.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定