如何根据字符串向量选择数据框的列,进行精确匹配?

huangapple go评论151阅读模式
英文:

How to select the columns of a dataframe based on a vector of strings, matching for exact coincidence?

问题

  1. 我有一个数据框,其中包含以下列名:

NewYork_10
NewYork_20
NewYork3_10
NewYork3_20
NewYork4_10
NewYork4_20
HongKong_10
HongKong_20
SanFrancisco_10
SanFrancisco_20

  1. 并且我有一个向量:

list <- c("NewYork", "SanFrancisco")

  1. 我想要一个脚本,它可以创建一个新的数据框,选择那些在下划线之前具有完全相同字符串的列。
  2. 在上面给出的示例中,你将获得一个新的数据框,其中包含以下列。
  3. NewYork_10
  4. NewYork_20
  5. SanFrancisco_10
  6. SanFrancisco_20
  7. 我尝试过多次使用grep
  8. `dplyr::select(matches(list_cities))`
  9. `dplyr::select(matches(paste0(list_cities), "_"))`
  10. 甚至尝试使用向量的锚点,但我不确定是否可能。
  11. `dplyr::select(matches(paste0("^", list_cities, "_.*")))`
  12. 但在每种情况下,它都捕获了所有以给定子字符串开头的向量值。
英文:

I have a dataframe with the followign column names:

  1. NewYork_10
  2. NewYork_20
  3. NewYork3_10
  4. NewYork3_20
  5. NewYork4_10
  6. NewYork4_20
  7. HongKong_10
  8. HongKong_20
  9. SanFrancisco_10
  10. SanFrancisco_20

And I have a vector:

  1. list &lt;- c(&quot;NewYork&quot;, &quot;SanFrancisco&quot;)

I want a script that creates a new dataframe, selecting those columns that have the exact same string before the underscore.
In the example given above, you would get a new dataframe with the following columns.
NewYork_10
NewYork_20
SanFrancisco_10
SanFrancisco_20

I did several tries with grep:

dplyr::select(matches(list_cities))

dplyr::select(matches(paste0(list_cities), &quot;_&quot;))

And even using anchors for a vector, which I'm not sure is possible.

dplyr::select(matches(paste0(&quot;^&quot;,list_cities, &quot;_.*&quot;)))

But in every case it's capturing all the values of the vector that start with the given substring.

答案1

得分: 1

你可以尝试:

  1. df[grep("^(NewYork|SanFrancisco)_", names(df))]

或者使用 dplyr::select

  1. library(tidyverse)
  2. df %>% select(matches("^(NewYork|SanFrancisco)_"))

其中 ^ 表示字符串的开头,(NewYork|SanFrancisco) 匹配以 NewYorkSanFrancisco 开头后跟 _

或者使用 startsWith

  1. df[Reduce(`|`, lapply(paste0(name_list, "_"), startsWith, x=names(df)))]

数据(来自 @benson23):

  1. df <- data.frame(NewYork_10 = 1,
  2. NewYork_20 = 1,
  3. NewYork3_10 = 1,
  4. NewYork3_20 = 1,
  5. NewYork4_10 = 1,
  6. NewYork4_20 = 1,
  7. HongKong_10 = 1,
  8. HongKong_20 = 1,
  9. SanFrancisco_10 = 1,
  10. SanFrancisco_20 = 1)
  11. name_list <- c("NewYork", "SanFrancisco")
英文:

You can try:

  1. df[grep(&quot;^(NewYork|SanFrancisco)_&quot;, names(df))]
  2. #df[grep(paste0(&quot;^(&quot;, paste0(name_list, collapse=&quot;|&quot;), &quot;)_&quot;), names(df))] #Alternative using the name_list
  3. # NewYork_10 NewYork_20 SanFrancisco_10 SanFrancisco_20
  4. #1 1 1 1 1

or using dplyr::select

  1. library(tidyverse)
  2. df %&gt;% select(matches(&quot;^(NewYork|SanFrancisco)_&quot;))
  3. # NewYork_10 NewYork_20 SanFrancisco_10 SanFrancisco_20
  4. #1 1 1 1 1

Where ^ is the start of the string, (NewYork|SanFrancisco) matches NewYork or SanFrancisco followed by _.

Or using startsWith:

  1. df[Reduce(`|`, lapply(paste0(name_list, &quot;_&quot;), startsWith, x=names(df)))]
  2. # NewYork_10 NewYork_20 SanFrancisco_10 SanFrancisco_20
  3. #1 1 1 1 1

Data (taken from @benson23)

  1. df &lt;- data.frame(NewYork_10 = 1,
  2. NewYork_20 = 1,
  3. NewYork3_10 = 1,
  4. NewYork3_20 = 1,
  5. NewYork4_10 = 1,
  6. NewYork4_20 = 1,
  7. HongKong_10 = 1,
  8. HongKong_20 = 1,
  9. SanFrancisco_10 = 1,
  10. SanFrancisco_20 = 1)
  11. name_list &lt;- c(&quot;NewYork&quot;, &quot;SanFrancisco&quot;)

答案2

得分: 1

We can also use matches:

  1. df %>%
  2. select(matches("(NewYork)|(SanFrancisco)_.*")
英文:

We can also use matches

  1. df %&gt;%
  2. select(matches(&quot;(NewYork)|(SanFrancisco)_.*&quot;)
  3. </details>

huangapple
  • 本文由 发表于 2023年3月7日 22:27:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663262.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定