英文:
Best way to exclude any of multiple characters in a string column when filtering a dataframe in R
问题
library(dplyr)
# Your input dataframe
df <- data.frame(Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
Value = c("2", "3", "4", "5", "6"))
# The character you want to filter on
char <- "a"
# Create a vector of characters to filter out
chars_to_filter <- setdiff(c("a", "b", "c", "d"), char)
# Filter the dataframe
df_filtered <- df %>%
filter(!grepl(paste(chars_to_filter, collapse = "|"), Name))
df_filtered
This code will filter the dataframe df
based on the character provided in the char
variable. It creates a vector chars_to_filter
containing the characters you want to filter out, and then uses grepl
to filter the rows where the Name
column does not contain any of those characters. The resulting filtered dataframe is stored in df_filtered
.
英文:
I have a dataframe with string and number data that I need to filter.
library(dplyr)
df <- data.frame (Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
Value = c("2", "3", "4", "5", "6")
)
The supplementary characters in Name can only be a, b, c or d. I want to find the best way to filter away all data where other characters than the one provided in the filter occur, while keeping data where Name does not contain characters. When filtering with "a", I wish to remove all data that contains "b", "c" or "d", keeping the first and last data:
Name Value
"101a,102a" "2"
"101,102,103" "6"
I can probably do this with if elses
If (char=="a") {
df <- filter(df, (!grepl("b", Name) | !grepl("c", Name) | !grepl("d", Name))
} else if (char=="b") {
df <- filter(df, (!grepl("a", Name) | !grepl("c", Name) | !grepl("d", Name))
} else if (char=="c") {
df <- filter(df, (!grepl("a", Name) | !grepl("b", Name) | !grepl("d", Name))
} else if (char=="d") {
df <- filter(df, (!grepl("a", Name) | !grepl("b", Name) | !grepl("c", Name))
}
But I was hoping someone could help me to something more efficient and shorter code. I'm looking for a code that essentially does this:
"remove char from 'a,b,c,d' and filter out all data where Name does not contain any the remaining chars".
I tried:
abcd <- c("a", "b", "c", "d")
df <- filter(df, !Name %in% abcd[!abcd==char])
but %in% seems to use match which requires perfect match, so I tried
df <- filter(!grepl(paste(abcd[!abcd==char], collapse="|"),Name))
but I can't get the right syntax. I think I need some help creating the
(!grepl("a", Name) | !grepl("b", Name) | !grepl("c", Name))
part on the fly.
答案1
得分: 1
如果您每次只处理一个字母,您可以使用以下类似的函数:
library(dplyr)
df <- data.frame(Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
Value = c("2", "3", "4", "5", "6")
)
keep <- function(df, keep = c("a", "b", "c", "d")) {
df[grepl(paste(keep, collapse = "|"), df$Name, fixed = TRUE), ]
}
> keep(df, "a")
Name Value
1 101a,102a 2
> keep(df, "b")
Name Value
2 101b,102b,103b 3
英文:
If you only ever do one letter as a time you could use a function like this.
library(dplyr)
df <- data.frame (Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
Value = c("2", "3", "4", "5", "6")
)
keep <- function(df, keep = c("a", "b", "c", "d")){
df[grepl(keep, df$Name, fixed = TRUE),]
}
> keep(df, "a")
Name Value
1 101a,102a 2
> keep(df, "b")
Name Value
2 101b,102b,103b 3
答案2
得分: 1
使用 paste
创建一个正则表达式,不包括您想要的字符。然后使用 filter
反转 grepl
的结果。
suppressPackageStartupMessages(
library(dplyr)
)
df <- data.frame(Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
Value = c("2", "3", "4", "5", "6"))
abcd <- c("a", "b", "c", "d")
char <- "a"
discard <- paste(abcd[abcd != char], collapse = "|")
filter(df, !grepl(discard, Name))
#> Name Value
#> 1 101a,102a 2
#> 2 101,103 6
基本的 R 方法如下。
char <- "a"
discard <- paste(abcd[abcd != char], collapse = "|")
df[grep(discard, df$Name, invert = TRUE), ]
#> Name Value
#> 1 101a,102a 2
#> 5 101,103 6
英文:
Use paste
to create a regex not including the character you want. Then filter
negating the result of grepl
.
suppressPackageStartupMessages(
library(dplyr)
)
df <- data.frame(Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
Value = c("2", "3", "4", "5", "6"))
abcd <- c("a", "b", "c", "d")
char <- "a"
discard <- paste(abcd[abcd != char], collapse = "|")
filter(df, !grepl(discard, Name))
#> Name Value
#> 1 101a,102a 2
#> 2 101,103 6
<sup>Created on 2023-05-28 with reprex v2.0.2</sup>
A base R way is the following.
char <- "a"
discard <- paste(abcd[abcd != char], collapse = "|")
df[grep(discard, df$Name, invert = TRUE), ]
#> Name Value
#> 1 101a,102a 2
#> 5 101,103 6
<sup>Created on 2023-05-28 with reprex v2.0.2</sup>
答案3
得分: 1
这是使用separate_rows()
将数据转换成长格式的解决方案:
library(tidyverse)
df %>%
separate_rows(Name) %>%
mutate(x = str_extract(Name, "[A-Za-z]"),
Name = parse_number(Name)) %>%
filter(x == "a" | is.na(x)) %>%
mutate(Name = ifelse(!is.na(x), paste0(Name, x), Name)) %>%
summarise(Name = toString(Name), .by=Value)
Value Name
1 2 101a, 102a
2 6 101, 103
请注意,这是R语言代码,用于将数据从宽格式转换为长格式,并对数据进行一些操作和汇总。
英文:
Here is solution with bringing the data in long format with separate_rows()
:
library(tidyverse)
df %>%
separate_rows(Name) %>%
mutate(x = str_extract(Name, "[A-Za-z]"),
Name = parse_number(Name)) %>%
filter(x == "a" | is.na(x)) %>%
mutate(Name = ifelse(!is.na(x), paste0(Name, x), Name)) %>%
summarise(Name = toString(Name), .by=Value)
Value Name
<chr> <chr>
1 2 101a, 102a
2 6 101, 103
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论