根据跨多列找到的字符串值的选择来筛选数据框。

huangapple go评论60阅读模式
英文:

Filter a dataframe based on a selection of string values found across multiple columns

问题

我有一个庞大的数据库,其中包含使用不同树种的重新种植项目,我想创建一个新的数据库,仅选择我感兴趣的树种。我有大约70个单词(即树种),我想从数据框中选择,跨足3个不同的列。我尝试使用'grepl'函数,但在添加具有相同关键词选择的多个列方面感到困惑。这些单词/树种可以与不是我关注的70个单词的其他树种一起出现,不确定是否是问题所在。

基本上,我正在尝试构建代码,以查找数据集中的70个单词的任何实例,并选择它们(或者选择移除不包含这70个单词之一的任何行),以避免在具有数千行的16个数据集中对70个以上的单词使用命令-f。

任何帮助都将不胜感激。

首先,我尝试使用'grepl'函数在名为'species'的第一列上过滤数据集,用于这约70个单词,但它打印了不包括这70个单词/树种的行。
这是以下代码的输出:

> dput(head(NCR[,c("REGION", "COMPONENT","SPECIES")]))
structure(list(REGION = c("NCR", "NCR", "NCR", "NCR", "NCR", 
"NCR"), COMPONENT = c("Urban", "Urban", "Urban", "Urban", "Urban", 
"Urban"), SPECIES = c("Narra", "Banaba, Caballero, Ilang ilang, Molave, Yellow alder,Bougainvilla,",
"Bignay, Camachile, Nangka, Sampaloc, Santol,Narra,kalumpit,langka,lipote,guyabano,palawan cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,bignay,molave",
"Sansevieria, Spider lily, Yellow morado, Zigzag, Sansevieria, Spider lily, Yellow morado, Zigzag\nSansevieria, Spider lily, Yellow morado, Zigzag",
"Banaba, Caballero, Ilang ilang, Narra, Tuai,", "Acacia, Acapulco, Antipolo, Bagras, Balete, Bougainvilla, Dao, Fire tree, Golden shower, Ipil, Kalumpit, Kamagong, Lipote, Manila palm, Molave, Nangka, Neem tree, Supa, Tuai, Yakal,mabolo,tabebuia,langka,bitaog,narracamachile,banaba,ilang\nilang,guyabano"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
key_terms <- c('mangrove','magrove','avicennia','bungalon','api-api','piapi','piape','miapi','myapi','miape','Rhizophora','bakau','Bakauan', 'bakaw','bakhaw','bacau','bacaw','Sonneratia','pagatpat','pedada','Nypa','nipa','nypa','sasa','Bruguiera','pototan','busain','langarai','Camptostemon','gapas','Ceriops','baras','tungog','tangal','Excoecaria','lipata','buta','Heritiera','dungon','Aegiceras','saging','Lumnitziera','tubao','culasi','kulasi','Osbornia','tawalis','bunot','Pemphis','bantigi','Scyphiphora','nilad','Xylocarpus','tabigi','tabige','piagao','piag-ao','tubo tubo','tubo-tubo','saging-saging','moluccensis','granatum','hydrophyllaceae','adicula','octodonta','corniculatum','littoralis','agallocha','tagal','decandra','philippinensis','parviflora','fruticans','caseolaris','ovata','alba' )
new_NCR <- filter(NCR, grepl(paste(key_terms, collapse='|'), SPECIES))
new_NCR
英文:

I have a huge database on replanting projects using different species of trees, and I want to create a new database selecting only the species I am interested in. I have ~70 words (i.e. species) I want to select from the dataframe, across 3 different columns. I'm trying to use the 'grepl' function, but I'm lost on adding multiple columns with the same selection of key words. The words/species can occur inconjunction with other species not targeted by my 70 words, not sure if that is the issue.

Essentially, I am trying to build code that finds any instance of the 70 words across the dataset, and selects them (or alternatively removes any row that does not include any of those 70), in order to avoid using command-f for 70+ words across a grand total of 16 datasets with thousands of rows.

Any help is much appreciated.

First I tried filtering the dataset with the 'grepl' function on the first column, called 'species' for the ~70 words, however it printed rows that do not include the 70 words/species.
This is the output of the following:

&gt; dput(head(NCR[,c(&quot;REGION&quot;, &quot;COMPONENT&quot;,&quot;SPECIES&quot;)]))
structure(list(REGION = c(&quot;NCR&quot;, &quot;NCR&quot;, &quot;NCR&quot;, &quot;NCR&quot;, &quot;NCR&quot;, 
&quot;NCR&quot;), COMPONENT = c(&quot;Urban&quot;, &quot;Urban&quot;, &quot;Urban&quot;, &quot;Urban&quot;, &quot;Urban&quot;, 
&quot;Urban&quot;), SPECIES = c(&quot;Narra&quot;, &quot;Banaba, Caballero, Ilang ilang, Molave, Yellow alder,Bougainvilla,&quot;, 
&quot;Bignay, Camachile, Nangka, Sampaloc, Santol,Narra,kalumpit,langka,lipote,guyabano,palawan cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,bignay,molave&quot;, 
&quot;Sansevieria, Spider lily, Yellow morado, Zigzag, Sansevieria, Spider lily, Yellow morado, Zigzag\nSansevieria, Spider lily, Yellow morado, Zigzag&quot;, 
&quot;Banaba, Caballero, Ilang ilang, Narra, Tuai,&quot;, &quot;Acacia, Acapulco, Antipolo, Bagras, Balete, Bougainvilla, Dao, Fire tree, Golden shower, Ipil, Kalumpit, Kamagong, Lipote, Manila palm, Molave, Nangka, Neem tree, Supa, Tuai, Yakal,mabolo,tabebuia,langka,bitaog,narracamachile,banaba,ilang\nilang,guyabano&quot;
)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
))
key_terms &lt;- c(&#39;mangrove&#39;,&#39;magrove&#39;,&#39;avicennia&#39;,&#39;bungalon&#39;,&#39;api-api&#39;,&#39;piapi&#39;,&#39;piape&#39;,&#39;miapi&#39;,&#39;myapi&#39;,&#39;miape&#39;,&#39;Rhizophora&#39;,&#39;bakau&#39;,&#39;Bakauan&#39;, &#39;bakaw&#39;,&#39;bakhaw&#39;,&#39;bacau&#39;,&#39;bacaw&#39;,&#39;Sonneratia&#39;,&#39;pagatpat&#39;,&#39;pedada&#39;,&#39;Nypa&#39;,&#39;nipa&#39;,&#39;nypa&#39;,&#39;sasa&#39;,&#39;Bruguiera&#39;,&#39;pototan&#39;,&#39;busain&#39;,&#39;langarai&#39;,&#39;Camptostemon&#39;,&#39;gapas&#39;,&#39;Ceriops&#39;,&#39;baras&#39;,&#39;tungog&#39;,&#39;tangal&#39;,&#39;Excoecaria&#39;,&#39;lipata&#39;,&#39;buta&#39;,&#39;Heritiera&#39;,&#39;dungon&#39;,&#39;Aegiceras&#39;,&#39;saging&#39;,&#39;Lumnitziera&#39;,&#39;tubao&#39;,&#39;culasi&#39;,&#39;kulasi&#39;,&#39;Osbornia&#39;,&#39;tawalis&#39;,&#39;bunot&#39;,&#39;Pemphis&#39;,&#39;bantigi&#39;,&#39;Scyphiphora&#39;,&#39;nilad&#39;,&#39;Xylocarpus&#39;,&#39;tabigi&#39;,&#39;tabige&#39;,&#39;piagao&#39;,&#39;piag-ao&#39;,&#39;tubo tubo&#39;,&#39;tubo-tubo&#39;,&#39;saging-saging&#39;,&#39;moluccensis&#39;,&#39;granatum&#39;,&#39;hydrophyllaceae&#39;,&#39;adicula&#39;,&#39;octodonta&#39;,&#39;corniculatum&#39;,&#39;littoralis&#39;,&#39;agallocha&#39;,&#39;tagal&#39;,&#39;decandra&#39;,&#39;philippinensis&#39;,&#39;parviflora&#39;,&#39;fruticans&#39;,&#39;caseolaris&#39;,&#39;ovata&#39;,&#39;alba&#39; )
new_NCR &lt;- filter(NCR, grepl(paste(key_terms, collapse=&#39;|&#39;), SPECIES))
new_NCR

答案1

得分: 0

你可以在这里的dplyr::filter()中使用dplyr::if_any

在你的示例数据中,没有任何值与key_terms匹配,因此返回了0行。我调整了key_terms,包括了“Narra”,它在几行中都有出现。

key_terms <- c('mangrove', 'alba', 'Narra')

filter(NCR, if_any(REGION:SPECIES, 
                   ~grepl(paste(key_terms, collapse='|'), .x)))

输出:

#1 NCR    Urban     "Narra"                                                                                                                                             
#2 NCR    Urban     "Bignay, Camachile, Nangka, Sampaloc, #Santol,Narra,kalumpit,langka,lipote,guyabano,palawan #cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,big…
#3 NCR    Urban     "Banaba, Caballero, Ilang ilang, Narra, Tuai,"
英文:

You should be able to use dplyr::if_any within your dplyr::filter() here.

You didn't have any of the values in key_terms in you sample data, so 0 rows were returned. I tweaked the key_terms to include "Narra", which is found in a few rows

key_terms &lt;- c(&#39;mangrove&#39;, &#39;alba&#39;, &#39;Narra&#39;)

filter(NCR, if_any(REGION:SPECIES, 
                   ~grepl(paste(key_terms, collapse=&#39;|&#39;), .x)))

Output:

#1 NCR    Urban     &quot;Narra&quot;                                                                                                                                             
#2 NCR    Urban     &quot;Bignay, Camachile, Nangka, Sampaloc, #Santol,Narra,kalumpit,langka,lipote,guyabano,palawan #cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,big…
#3 NCR    Urban     &quot;Banaba, Caballero, Ilang ilang, Narra, Tuai,&quot;    

huangapple
  • 本文由 发表于 2023年6月29日 00:44:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575199.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定