R dataframe/ lapply(): get rid of rows with particular values in columns containing particular strings, while keeping everything else?

huangapple go评论51阅读模式
英文:

R dataframe/ lapply(): get rid of rows with particular values in columns containing particular strings, while keeping everything else?

问题

以下是翻译的代码部分:

# 使用lapply函数来筛选包含4的标志列
filtered_dataframes <- lapply(df_list, function(df) {
  # 获取标志列的列名
  flags_columns <- grep("flags", names(df), value = TRUE)
  
  # 遍历每个标志列,筛选包含4的行
  for (col in flags_columns) {
    df <- df[!grepl("4", df[[col]]), ]
  }
  
  return(df)
})

请注意,此代码会在每个数据框中的标志列中查找包含数字4的行,并从数据框中删除这些行。最后,filtered_dataframes 包含了筛选后的数据框列表。

英文:

I have 16 dataframes I am trying to quality check and delete poor quality rows in R. I already know of lapply() and have used it for simpler wrangling problems to apply the same thing to all my dataframes at once, but for whatever reason I'm having a mental block currently.

The format of each individual dataframe is like so, where every other column contains a "flags" column. The flags column contains strings of values. If any of the values in the string are a 4, I want to filter those rows out of the dataframe.

head(df)

timestamp    wind_speed_max wind_speed_max_flags   wind_speed_mean
1            UTC meters per second                  NAN meters per second
2    data logger   Airmar WS-200WX                  NAN   Airmar WS-200WX
3 6/2/2015 15:46               7.6              1 1 4 1              5.12
4 6/2/2015 16:01               7.2              1 1 1 1              5.16
5 6/2/2015 16:16               8.1              1 1 1 1              5.97
6 6/2/2015 16:31               8.5              1 1 1 1             5.909
  wind_speed_mean_flags wind_direction_mean wind_direction_mean_flags
1                   NAN             degrees                       NAN
2                   NAN     Airmar WS-200WX                       NAN
3               1 1 1 1               57.14                   1 2 1 2
4               1 1 1 1               61.64                   1 2 1 4
5               1 1 1 1                  68                   1 2 1 2
6               4 1 1 1               73.14                   1 2 1 2

I know I can try to grep("flags") for the column names, and I also think I could use a similar grep method to filter out the strings containing a 4? Perhaps using some Boolean operators. But I am struggling to piece all of this together to retain the rest of the data, and to ideally perform this at the same time for all 16 dataframes for example lapply(df_list, function(x) &lt;insert code that can filter out flags with 4s for each x dataframe&gt;)

答案1

得分: 1

让我们从编写代码来过滤一个数据框开始 - 我们将查看包含“flags”在名称中的列,并使用“grep”查找“4”。然后,我们将使用rowSums来计算每行中的4的数量,仅保留4的数量等于0的行。

# 计算每行中“flag”列中的4的数量
count_4 = df[grepl("flags", names(df))] %>%
  sapply(grepl, pattern = "4") %>%
  rowSums(na.rm = TRUE)

将其放入lapply中:

modified_data_list = lapply(data_list, function(df) {
  count_4 = df[grepl("flags", names(df))] %>%
    sapply(grepl, pattern = "4") %>%
    rowSums(na.rm = TRUE)
  df[count_4 == 0, ]
})
英文:

Let's start by writing code to filter one data frame - we'll look at the columns that include "flags" in the name and grep for "4". Then we'll use rowSums to count the number of 4s in each row, keeping only rows with 4 count == 0.

# count the number of 4s in each row of &quot;flag&quot; cols of `df`
count_4 = df[grepl(&quot;flags&quot;, names(df))] |&gt;
  sapply(grepl, pattern = &quot;4&quot;) |&gt;
  rowSums(na.rm = TRUE)

Putting it in lapply:

modified_data_list = lapply(data_list, function(df) {
  count_4 = df[grepl(&quot;flags&quot;, names(df))] |&gt;
    sapply(grepl, pattern = &quot;4&quot;) |&gt;
    rowSums(na.rm = TRUE)
  df[count_4 == 0, ]
})

huangapple
  • 本文由 发表于 2023年2月24日 02:35:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75548970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定