2023年6月29日 00:44:08go评论108阅读模式

英文:

Filter a dataframe based on a selection of string values found across multiple columns

问题

我有一个庞大的数据库，其中包含使用不同树种的重新种植项目，我想创建一个新的数据库，仅选择我感兴趣的树种。我有大约70个单词（即树种），我想从数据框中选择，跨足3个不同的列。我尝试使用'grepl'函数，但在添加具有相同关键词选择的多个列方面感到困惑。这些单词/树种可以与不是我关注的70个单词的其他树种一起出现，不确定是否是问题所在。

基本上，我正在尝试构建代码，以查找数据集中的70个单词的任何实例，并选择它们（或者选择移除不包含这70个单词之一的任何行），以避免在具有数千行的16个数据集中对70个以上的单词使用命令-f。

任何帮助都将不胜感激。

首先，我尝试使用'grepl'函数在名为'species'的第一列上过滤数据集，用于这约70个单词，但它打印了不包括这70个单词/树种的行。
这是以下代码的输出：

> dput(head(NCR[,c("REGION", "COMPONENT","SPECIES")]))
structure(list(REGION = c("NCR", "NCR", "NCR", "NCR", "NCR", 
"NCR"), COMPONENT = c("Urban", "Urban", "Urban", "Urban", "Urban", 
"Urban"), SPECIES = c("Narra", "Banaba, Caballero, Ilang ilang, Molave, Yellow alder,Bougainvilla,",
"Bignay, Camachile, Nangka, Sampaloc, Santol,Narra,kalumpit,langka,lipote,guyabano,palawan cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,bignay,molave",
"Sansevieria, Spider lily, Yellow morado, Zigzag, Sansevieria, Spider lily, Yellow morado, Zigzag\nSansevieria, Spider lily, Yellow morado, Zigzag",
"Banaba, Caballero, Ilang ilang, Narra, Tuai,", "Acacia, Acapulco, Antipolo, Bagras, Balete, Bougainvilla, Dao, Fire tree, Golden shower, Ipil, Kalumpit, Kamagong, Lipote, Manila palm, Molave, Nangka, Neem tree, Supa, Tuai, Yakal,mabolo,tabebuia,langka,bitaog,narracamachile,banaba,ilang\nilang,guyabano"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
key_terms <- c('mangrove','magrove','avicennia','bungalon','api-api','piapi','piape','miapi','myapi','miape','Rhizophora','bakau','Bakauan', 'bakaw','bakhaw','bacau','bacaw','Sonneratia','pagatpat','pedada','Nypa','nipa','nypa','sasa','Bruguiera','pototan','busain','langarai','Camptostemon','gapas','Ceriops','baras','tungog','tangal','Excoecaria','lipata','buta','Heritiera','dungon','Aegiceras','saging','Lumnitziera','tubao','culasi','kulasi','Osbornia','tawalis','bunot','Pemphis','bantigi','Scyphiphora','nilad','Xylocarpus','tabigi','tabige','piagao','piag-ao','tubo tubo','tubo-tubo','saging-saging','moluccensis','granatum','hydrophyllaceae','adicula','octodonta','corniculatum','littoralis','agallocha','tagal','decandra','philippinensis','parviflora','fruticans','caseolaris','ovata','alba' )
new_NCR <- filter(NCR, grepl(paste(key_terms, collapse='|'), SPECIES))
new_NCR

英文:

I have a huge database on replanting projects using different species of trees, and I want to create a new database selecting only the species I am interested in. I have ~70 words (i.e. species) I want to select from the dataframe, across 3 different columns. I'm trying to use the 'grepl' function, but I'm lost on adding multiple columns with the same selection of key words. The words/species can occur inconjunction with other species not targeted by my 70 words, not sure if that is the issue.

Essentially, I am trying to build code that finds any instance of the 70 words across the dataset, and selects them (or alternatively removes any row that does not include any of those 70), in order to avoid using command-f for 70+ words across a grand total of 16 datasets with thousands of rows.

Any help is much appreciated.

First I tried filtering the dataset with the 'grepl' function on the first column, called 'species' for the ~70 words, however it printed rows that do not include the 70 words/species.
This is the output of the following:

&gt; dput(head(NCR[,c(&quot;REGION&quot;, &quot;COMPONENT&quot;,&quot;SPECIES&quot;)]))
structure(list(REGION = c(&quot;NCR&quot;, &quot;NCR&quot;, &quot;NCR&quot;, &quot;NCR&quot;, &quot;NCR&quot;, 
&quot;NCR&quot;), COMPONENT = c(&quot;Urban&quot;, &quot;Urban&quot;, &quot;Urban&quot;, &quot;Urban&quot;, &quot;Urban&quot;, 
&quot;Urban&quot;), SPECIES = c(&quot;Narra&quot;, &quot;Banaba, Caballero, Ilang ilang, Molave, Yellow alder,Bougainvilla,&quot;, 
&quot;Bignay, Camachile, Nangka, Sampaloc, Santol,Narra,kalumpit,langka,lipote,guyabano,palawan cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,bignay,molave&quot;, 
&quot;Sansevieria, Spider lily, Yellow morado, Zigzag, Sansevieria, Spider lily, Yellow morado, Zigzag\nSansevieria, Spider lily, Yellow morado, Zigzag&quot;, 
&quot;Banaba, Caballero, Ilang ilang, Narra, Tuai,&quot;, &quot;Acacia, Acapulco, Antipolo, Bagras, Balete, Bougainvilla, Dao, Fire tree, Golden shower, Ipil, Kalumpit, Kamagong, Lipote, Manila palm, Molave, Nangka, Neem tree, Supa, Tuai, Yakal,mabolo,tabebuia,langka,bitaog,narracamachile,banaba,ilang\nilang,guyabano&quot;
)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
))
key_terms &lt;- c(&#39;mangrove&#39;,&#39;magrove&#39;,&#39;avicennia&#39;,&#39;bungalon&#39;,&#39;api-api&#39;,&#39;piapi&#39;,&#39;piape&#39;,&#39;miapi&#39;,&#39;myapi&#39;,&#39;miape&#39;,&#39;Rhizophora&#39;,&#39;bakau&#39;,&#39;Bakauan&#39;, &#39;bakaw&#39;,&#39;bakhaw&#39;,&#39;bacau&#39;,&#39;bacaw&#39;,&#39;Sonneratia&#39;,&#39;pagatpat&#39;,&#39;pedada&#39;,&#39;Nypa&#39;,&#39;nipa&#39;,&#39;nypa&#39;,&#39;sasa&#39;,&#39;Bruguiera&#39;,&#39;pototan&#39;,&#39;busain&#39;,&#39;langarai&#39;,&#39;Camptostemon&#39;,&#39;gapas&#39;,&#39;Ceriops&#39;,&#39;baras&#39;,&#39;tungog&#39;,&#39;tangal&#39;,&#39;Excoecaria&#39;,&#39;lipata&#39;,&#39;buta&#39;,&#39;Heritiera&#39;,&#39;dungon&#39;,&#39;Aegiceras&#39;,&#39;saging&#39;,&#39;Lumnitziera&#39;,&#39;tubao&#39;,&#39;culasi&#39;,&#39;kulasi&#39;,&#39;Osbornia&#39;,&#39;tawalis&#39;,&#39;bunot&#39;,&#39;Pemphis&#39;,&#39;bantigi&#39;,&#39;Scyphiphora&#39;,&#39;nilad&#39;,&#39;Xylocarpus&#39;,&#39;tabigi&#39;,&#39;tabige&#39;,&#39;piagao&#39;,&#39;piag-ao&#39;,&#39;tubo tubo&#39;,&#39;tubo-tubo&#39;,&#39;saging-saging&#39;,&#39;moluccensis&#39;,&#39;granatum&#39;,&#39;hydrophyllaceae&#39;,&#39;adicula&#39;,&#39;octodonta&#39;,&#39;corniculatum&#39;,&#39;littoralis&#39;,&#39;agallocha&#39;,&#39;tagal&#39;,&#39;decandra&#39;,&#39;philippinensis&#39;,&#39;parviflora&#39;,&#39;fruticans&#39;,&#39;caseolaris&#39;,&#39;ovata&#39;,&#39;alba&#39; )
new_NCR &lt;- filter(NCR, grepl(paste(key_terms, collapse=&#39;|&#39;), SPECIES))
new_NCR

答案1

得分: 0

你可以在这里的dplyr::filter()中使用dplyr::if_any。

在你的示例数据中，没有任何值与key_terms匹配，因此返回了0行。我调整了key_terms，包括了“Narra”，它在几行中都有出现。

key_terms <- c('mangrove', 'alba', 'Narra')
filter(NCR, if_any(REGION:SPECIES, 
                   ~grepl(paste(key_terms, collapse='|'), .x)))

输出：

#1 NCR    Urban     "Narra"                                                                                                                                             
#2 NCR    Urban     "Bignay, Camachile, Nangka, Sampaloc, #Santol,Narra,kalumpit,langka,lipote,guyabano,palawan #cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,big…
#3 NCR    Urban     "Banaba, Caballero, Ilang ilang, Narra, Tuai,"

英文:

You should be able to use dplyr::if_any within your dplyr::filter() here.

You didn't have any of the values in key_terms in you sample data, so 0 rows were returned. I tweaked the key_terms to include "Narra", which is found in a few rows

key_terms &lt;- c(&#39;mangrove&#39;, &#39;alba&#39;, &#39;Narra&#39;)
filter(NCR, if_any(REGION:SPECIES, 
                   ~grepl(paste(key_terms, collapse=&#39;|&#39;), .x)))

Output:

#1 NCR    Urban     &quot;Narra&quot;                                                                                                                                             
#2 NCR    Urban     &quot;Bignay, Camachile, Nangka, Sampaloc, #Santol,Narra,kalumpit,langka,lipote,guyabano,palawan #cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,big…
#3 NCR    Urban     &quot;Banaba, Caballero, Ilang ilang, Narra, Tuai,&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据跨多列找到的字符串值的选择来筛选数据框。

问题

答案1

同步Shiny应用中两个rHandsontable输出之间的列顺序

从复制的列表中提取一些字符串到字符变量中。

cmdstanR：从stan模型拟合中提取抽样结果。

R pheatmap确定列顺序

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。