
huangapple go评论181阅读模式

Best way to exclude any of multiple characters in a string column when filtering a dataframe in R



# Your input dataframe
df <- data.frame(Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
                 Value = c("2", "3", "4", "5", "6"))

# The character you want to filter on
char <- "a"

# Create a vector of characters to filter out
chars_to_filter <- setdiff(c("a", "b", "c", "d"), char)

# Filter the dataframe
df_filtered <- df %>%
  filter(!grepl(paste(chars_to_filter, collapse = "|"), Name))


This code will filter the dataframe df based on the character provided in the char variable. It creates a vector chars_to_filter containing the characters you want to filter out, and then uses grepl to filter the rows where the Name column does not contain any of those characters. The resulting filtered dataframe is stored in df_filtered.


I have a dataframe with string and number data that I need to filter.

df &lt;- data.frame (Name  = c(&quot;101a,102a&quot;, &quot;101b,102b,103b&quot;, &quot;103c&quot;, &quot;102d,103d&quot;, &quot;101,103&quot;),
                  Value = c(&quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;)

The supplementary characters in Name can only be a, b, c or d. I want to find the best way to filter away all data where other characters than the one provided in the filter occur, while keeping data where Name does not contain characters. When filtering with "a", I wish to remove all data that contains "b", "c" or "d", keeping the first and last data:

Name                Value
&quot;101a,102a&quot;         &quot;2&quot;
&quot;101,102,103&quot;       &quot;6&quot;

I can probably do this with if elses

If (char==&quot;a&quot;) {
	df &lt;- filter(df, (!grepl(&quot;b&quot;, Name) | !grepl(&quot;c&quot;, Name) | !grepl(&quot;d&quot;, Name))
} else if (char==&quot;b&quot;) {
	df &lt;- filter(df, (!grepl(&quot;a&quot;, Name) | !grepl(&quot;c&quot;, Name) | !grepl(&quot;d&quot;, Name))
} else if (char==&quot;c&quot;) {
	df &lt;- filter(df, (!grepl(&quot;a&quot;, Name) | !grepl(&quot;b&quot;, Name) | !grepl(&quot;d&quot;, Name))
} else if (char==&quot;d&quot;) {
	df &lt;- filter(df, (!grepl(&quot;a&quot;, Name) | !grepl(&quot;b&quot;, Name) | !grepl(&quot;c&quot;, Name))

But I was hoping someone could help me to something more efficient and shorter code. I'm looking for a code that essentially does this:

"remove char from 'a,b,c,d' and filter out all data where Name does not contain any the remaining chars".

I tried:

abcd &lt;- c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;)
df &lt;- filter(df, !Name %in% abcd[!abcd==char])

but %in% seems to use match which requires perfect match, so I tried

df &lt;- filter(!grepl(paste(abcd[!abcd==char], collapse=&quot;|&quot;),Name))

but I can't get the right syntax. I think I need some help creating the

(!grepl(&quot;a&quot;, Name) | !grepl(&quot;b&quot;, Name) | !grepl(&quot;c&quot;, Name))

part on the fly.


得分: 1


df <- data.frame(Name = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
                 Value = c("2", "3", "4", "5", "6")

keep <- function(df, keep = c("a", "b", "c", "d")) {
     df[grepl(paste(keep, collapse = "|"), df$Name, fixed = TRUE), ]
> keep(df, "a")
       Name Value
1 101a,102a     2

> keep(df, "b")
            Name Value
2 101b,102b,103b     3

If you only ever do one letter as a time you could use a function like this.

df &lt;- data.frame (Name  = c(&quot;101a,102a&quot;, &quot;101b,102b,103b&quot;, &quot;103c&quot;, &quot;102d,103d&quot;, &quot;101,103&quot;),
                  Value = c(&quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;)

keep &lt;- function(df, keep = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;)){
     df[grepl(keep, df$Name,  fixed = TRUE),]

&gt; keep(df, &quot;a&quot;)
       Name Value
1 101a,102a     2

&gt; keep(df, &quot;b&quot;)
            Name Value
2 101b,102b,103b     3


得分: 1

使用 paste 创建一个正则表达式,不包括您想要的字符。然后使用 filter 反转 grepl 的结果。


df <- data.frame(Name  = c("101a,102a", "101b,102b,103b", "103c", "102d,103d", "101,103"),
                 Value = c("2", "3", "4", "5", "6"))

abcd <- c("a", "b", "c", "d")

char <- "a"

discard <- paste(abcd[abcd != char], collapse = "|")
filter(df, !grepl(discard, Name))
#>        Name Value
#> 1 101a,102a     2
#> 2   101,103     6

基本的 R 方法如下。

char <- "a"

discard <- paste(abcd[abcd != char], collapse = "|")
df[grep(discard, df$Name, invert = TRUE), ]
#>        Name Value
#> 1 101a,102a     2
#> 5   101,103     6

Use paste to create a regex not including the character you want. Then filter negating the result of grepl.


df &lt;- data.frame(Name  = c(&quot;101a,102a&quot;, &quot;101b,102b,103b&quot;, &quot;103c&quot;, &quot;102d,103d&quot;, &quot;101,103&quot;),
                 Value = c(&quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))

abcd &lt;- c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;)

char &lt;- &quot;a&quot;

discard &lt;- paste(abcd[abcd != char], collapse = &quot;|&quot;)
filter(df, !grepl(discard, Name))
#&gt;        Name Value
#&gt; 1 101a,102a     2
#&gt; 2   101,103     6

<sup>Created on 2023-05-28 with reprex v2.0.2</sup>

A base R way is the following.

char &lt;- &quot;a&quot;

discard &lt;- paste(abcd[abcd != char], collapse = &quot;|&quot;)
df[grep(discard, df$Name, invert = TRUE), ]
#&gt;        Name Value
#&gt; 1 101a,102a     2
#&gt; 5   101,103     6

<sup>Created on 2023-05-28 with reprex v2.0.2</sup>


得分: 1



df %>%
  separate_rows(Name) %>%
  mutate(x = str_extract(Name, "[A-Za-z]"),
         Name = parse_number(Name)) %>%
  filter(x == "a" | is.na(x)) %>%
  mutate(Name = ifelse(!is.na(x), paste0(Name, x), Name)) %>%
  summarise(Name = toString(Name), .by=Value)

  Value Name      
1 2     101a, 102a
2 6     101, 103  



Here is solution with bringing the data in long format with separate_rows():


df %&gt;% 
  separate_rows(Name) %&gt;% 
  mutate(x = str_extract(Name, &quot;[A-Za-z]&quot;),
         Name = parse_number(Name)) %&gt;% 
  filter(x == &quot;a&quot; | is.na(x)) %&gt;% 
  mutate(Name = ifelse(!is.na(x), paste0(Name, x), Name)) %&gt;% 
  summarise(Name = toString(Name), .by=Value)

  Value Name      
  &lt;chr&gt; &lt;chr&gt;     
1 2     101a, 102a
2 6     101, 103  

  • 本文由 发表于 2023年5月29日 03:34:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76353293.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
