2023年2月9日 00:27:12go评论62阅读模式

英文:

How to filter R dataset by multiple partial match strings, similar to SQL % wildcard?

问题

我有一个包含一个感兴趣字段和一个字符串列表（数百个字符串）的数据集。

我想要做的是，对于数据的每一行，检查该字段是否包含其中任何一个部分字符串。

实质上，我想要复制SQL的%通配符。所以，例如，如果一个值是"Game123"，而我的字符串列表中有一个字符串是"Ga"，我希望它们匹配（但我不希望"OGame"匹配"Ga"）。

我希望编写类似这样的语句：

df %>%
filter(My_Field 包含 List_Of_Strings 中的任何一个)

如何填写这个筛选语句？

我尝试使用%in%运算符，但无法使其工作。我知道如何使用子字符串来检查单个字符串，但我有一长串字符串需要检查。

https://stackoverflow.com/questions/46215672/r-filter-rows-based-on-multiple-partial-strings-applied-to-multiple-columns：这个帖子与我尝试做的事情类似，但我的子字符串列表有400多个，所以我无法手动在grepl语句中写出所有这些（我认为？）

英文:

I have a dataset with with a field of interest and a list of strings (several hundred of them).

What I want to do is, for each line of the data, to check if the field has any of the partials strings in it.

Essentially, I want to replicate the SQL % wildcard. So, if for example a value is "Game123" and one of my strings is "Ga" I want that to be a match. (But I don't want "OGame" to match "Ga").

I'm hoping to write some statement like this:

df%&gt;%
filter(My_Field contains any one of List_Of_Strings)

How do I fill in that filter statement?

I tried to use the %in% operator but couldn't make it work. I know how to use substrings to check against a single string, but I have a long list of them and need to check all of them.

https://stackoverflow.com/questions/46215672/r-filter-rows-based-on-multiple-partial-strings-applied-to-multiple-columns: This post is similar to what I'm trying to do, but my list of substrings is 400 plus, so I can't write it all out manually in a grepl statement (I think?)

答案1

得分: 1

我猜你面临的问题是这样的：

你有一个可以称为关键词的列表（你称之为“字符串列表”），以及一个包含文本的向量/列（你称之为“感兴趣的字段”），你的目标是根据关键词的存在与否来筛选向量/列。如果我理解正确，解决方案可能如下：

数据：

a. 关键词列表：

keys <- c("how", "why", "what")

b. 包含文本向量/列的数据框：

df <- data.frame(
  text = c("Hi there", "How are you?", "I'm fine.", "So how's work?", "Ah kinda stressful.", "Why?", "Well you know")
)

解决方案：

要在text中根据keys进行筛选，你需要将keys转换为正则表达式的替代模式（通过使用|来合并字符串）。根据你的keys，可能有用甚至是必要的，还要包括单词边界标记\\b（以防需要匹配keys值，但不能出现在其他单词中）。最后，如果大小写可能会有问题，我们可以使用不区分大小写的标志(?i)：

df %>%
  filter(str_detect(text, str_c("(?i)\\b(", str_c(keys, collapse = "|"), ")\\b")))

这将筛选包含关键词的文本。

答案结束。

英文:

I guess the problem you're facing is this:

You have a list of what could be called key words (what you call "a list of strings") and a vector/column with text (what you call "a field of interest") and your goal is to filter the vector/column on whether or not any of the key words is present. If that's correct the solution might be this:

Data:

a. List of key words:

keys &lt;- c(&quot;how&quot;, &quot;why&quot;, &quot;what&quot;)

b. Dataframe with a vector/column of text:

df &lt;- data.frame(
  text = c(&quot;Hi there&quot;, &quot;How are you?&quot;, &quot;I&#39;m fine.&quot;, &quot;So how&#39;s work?&quot;, &quot;Ah kinda stressful.&quot;, &quot;Why?&quot;, &quot;Well you know&quot;)
)

Solution:

To filter df on keys in text you need to convert keys into a regex alternation pattern (by collapsing the strings with |). Depending on your keys it may be useful or even necessary to also include word \\boundary markers (in case the keys values need to match as such, but not occurring inside other words). And finally, if there may be an issue with lower- or upper-case, we can use the case-insensitive flag (?i):

df %&gt;%
  filter(str_detect(text, str_c(&quot;(?i)\\b(&quot;, str_c(keys, collapse = &quot;|&quot;), &quot;)\\b&quot;)))
            text
1   How are you?
2 So how&#39;s work?
3           Why?

答案2

得分: 0

由于没有特定的数据集或可复制的示例，我可以想到一种使用两个apply函数和巧妙使用正则表达式来实现的方法。请记住正则表达式操作符^仅在其后的表达式出现在字符串开头时匹配。

library(dplyr)

MyField <- c("OGame","Game123","Duck","Dugame","Aldubame")

df <- data.frame(MyField)

ListOfStrings <- c("^Ga","^Du") #注意这里使用了^

match_s <- function(patterns, entry){
  lapply(patterns, grepl, x = entry) %>% unlist() %>% any()
}

df$match_string <- lapply(df$MyField, match_s, entry = ListOfStrings)

df %>% filter(match_string == 1)

注意：我已经保留了代码部分的原文，只翻译了注释和字符串部分。

英文:

Since there is no particular dataset or reproductible example, I can think of a way to implement it with two apply functions and a smart use of regex. Remember that the regex operator ^ matches only if the following expression shows up in its beginning.

library(dplyr)

MyField &lt;- c(&quot;OGame&quot;,&quot;Game123&quot;,&quot;Duck&quot;,&quot;Dugame&quot;,&quot;Aldubame&quot;)

df &lt;- data.frame(MyField)

ListOfStrings &lt;- c(&quot;^Ga&quot;,&quot;^Du&quot;) #Notice the use of ^ here

match_s &lt;- function(patterns,entry){
  lapply(patterns,grepl,x = entry) %&gt;% unlist() %&gt;% any()
}

df$match_string &lt;- lapply(df$MyField, match_s, pat = ListOfStrings)

df %&gt;% filter(match_string == 1)

答案3

得分: 0

使用 dplyr（以stringr中的words和sentences作为示例），以及与 \\b 结合使用 grepl 以获得单词边界匹配的开头。

library(stringr)
library(dplyr)

set.seed(22)

tibble(sentences) %>% 
  rowwise() %>% 
  filter(any(sapply(words[sample(length(words), 10)], function(x) 
    grepl(paste0("\\b", x), sentences)))) %>% 
  ungroup()
# A tibble: 32 × 1
   sentences                                      
   <chr>                                          
 1 It's easy to tell the depth of a well.         
 2 Kick the ball straight and follow through.     
 3 A king ruled the state in the early days.      
 4 March the soldiers past the next hill.         
 5 The dune rose from the edge of the water.      
 6 The grass curled around the fence post.        
 7 Cats and Dogs each hate the other.             
 8 The harder he tried the less he got done.      
 9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows

英文:

With dplyr (using stringr for words and sentences as examples) and grepl in conjunction with \\b to get the word boundary match at the beginning.

library(stringr)
library(dplyr)

set.seed(22)

tibble(sentences) %&gt;% 
  rowwise() %&gt;% 
  filter(any(sapply(words[sample(length(words), 10)], function(x) 
    grepl(paste0(&quot;\\b&quot;, x), sentences)))) %&gt;% 
  ungroup()
# A tibble: 32 &#215; 1
   sentences                                    
   &lt;chr&gt;                                        
 1 It&#39;s easy to tell the depth of a well.       
 2 Kick the ball straight and follow through.   
 3 A king ruled the state in the early days.    
 4 March the soldiers past the next hill.       
 5 The dune rose from the edge of the water.    
 6 The grass curled around the fence post.      
 7 Cats and Dogs each hate the other.           
 8 The harder he tried the less he got done.    
 9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何按多个部分匹配字符串筛选 R 数据集，类似于 SQL 中的%通配符？

问题

答案1

答案2

答案3

在rnaturalearth中添加一个常数离岸线。

从数据框中获取统计信息。

在R Plotly中的直方图：设置断点数量

$ operator is invalid for atomic vectors ERROR while running R packages MetaLonDA

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论