如何根据字符串向量选择数据框的列,进行精确匹配?

huangapple go评论104阅读模式
英文:

How to select the columns of a dataframe based on a vector of strings, matching for exact coincidence?

问题

我有一个数据框,其中包含以下列名:

NewYork_10
NewYork_20
NewYork3_10
NewYork3_20
NewYork4_10
NewYork4_20
HongKong_10
HongKong_20
SanFrancisco_10
SanFrancisco_20


并且我有一个向量:

list <- c("NewYork", "SanFrancisco")


我想要一个脚本,它可以创建一个新的数据框,选择那些在下划线之前具有完全相同字符串的列。
在上面给出的示例中,你将获得一个新的数据框,其中包含以下列。
NewYork_10
NewYork_20
SanFrancisco_10
SanFrancisco_20

我尝试过多次使用grep:

`dplyr::select(matches(list_cities))`

`dplyr::select(matches(paste0(list_cities), "_"))`

甚至尝试使用向量的锚点,但我不确定是否可能。

`dplyr::select(matches(paste0("^", list_cities, "_.*")))`

但在每种情况下,它都捕获了所有以给定子字符串开头的向量值。
英文:

I have a dataframe with the followign column names:

NewYork_10
NewYork_20
NewYork3_10
NewYork3_20
NewYork4_10
NewYork4_20
HongKong_10
HongKong_20
SanFrancisco_10
SanFrancisco_20

And I have a vector:

list &lt;- c(&quot;NewYork&quot;, &quot;SanFrancisco&quot;)

I want a script that creates a new dataframe, selecting those columns that have the exact same string before the underscore.
In the example given above, you would get a new dataframe with the following columns.
NewYork_10
NewYork_20
SanFrancisco_10
SanFrancisco_20

I did several tries with grep:

dplyr::select(matches(list_cities))

dplyr::select(matches(paste0(list_cities), &quot;_&quot;))

And even using anchors for a vector, which I'm not sure is possible.

dplyr::select(matches(paste0(&quot;^&quot;,list_cities, &quot;_.*&quot;)))

But in every case it's capturing all the values of the vector that start with the given substring.

答案1

得分: 1

你可以尝试:

df[grep("^(NewYork|SanFrancisco)_", names(df))]

或者使用 dplyr::select

library(tidyverse)
df %>% select(matches("^(NewYork|SanFrancisco)_"))

其中 ^ 表示字符串的开头,(NewYork|SanFrancisco) 匹配以 NewYorkSanFrancisco 开头后跟 _

或者使用 startsWith

df[Reduce(`|`, lapply(paste0(name_list, "_"), startsWith, x=names(df)))]

数据(来自 @benson23):

df <- data.frame(NewYork_10 = 1,
           NewYork_20 = 1,
           NewYork3_10 = 1,
           NewYork3_20 = 1,
           NewYork4_10 = 1,
           NewYork4_20 = 1,
           HongKong_10 = 1,
           HongKong_20 = 1,
           SanFrancisco_10 = 1,
           SanFrancisco_20 = 1)

name_list <- c("NewYork", "SanFrancisco")
英文:

You can try:

df[grep(&quot;^(NewYork|SanFrancisco)_&quot;, names(df))]
#df[grep(paste0(&quot;^(&quot;, paste0(name_list, collapse=&quot;|&quot;), &quot;)_&quot;), names(df))] #Alternative using the name_list
#  NewYork_10 NewYork_20 SanFrancisco_10 SanFrancisco_20
#1          1          1               1               1

or using dplyr::select

library(tidyverse)
df %&gt;% select(matches(&quot;^(NewYork|SanFrancisco)_&quot;))
#  NewYork_10 NewYork_20 SanFrancisco_10 SanFrancisco_20
#1          1          1               1               1

Where ^ is the start of the string, (NewYork|SanFrancisco) matches NewYork or SanFrancisco followed by _.

Or using startsWith:

df[Reduce(`|`, lapply(paste0(name_list, &quot;_&quot;), startsWith, x=names(df)))]
#  NewYork_10 NewYork_20 SanFrancisco_10 SanFrancisco_20
#1          1          1               1               1

Data (taken from @benson23)

df &lt;- data.frame(NewYork_10 = 1,
           NewYork_20 = 1,
           NewYork3_10 = 1,
           NewYork3_20 = 1,
           NewYork4_10 = 1,
           NewYork4_20 = 1,
           HongKong_10 = 1,
           HongKong_20 = 1,
           SanFrancisco_10 = 1,
           SanFrancisco_20 = 1)

name_list &lt;- c(&quot;NewYork&quot;, &quot;SanFrancisco&quot;)

答案2

得分: 1

We can also use matches:

df %>%
    select(matches("(NewYork)|(SanFrancisco)_.*")
英文:

We can also use matches

df %&gt;%
    select(matches(&quot;(NewYork)|(SanFrancisco)_.*&quot;)

</details>



huangapple
  • 本文由 发表于 2023年3月7日 22:27:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663262.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定