英文:
Looping over data frame to "clean" data
问题
这是我所拥有的数据类型:
日期 | 站点 | 参数1 | 参数2 |
---|---|---|---|
2020-01-01 | A | <5 | 45 |
2020-02-01 | B | <5 | 47 |
为了能够绘制这些数据并标记LOQ值(<5),以及计算一些基本统计数据,我需要创建新的列,其中包括LOQ标志(<)和数值数据。
我不知道参数的确切名称(实际上是“Fe”、“Cu”、“N-tot”等等),所以我想循环遍历参数列(不包括日期和站点),为每个参数创建两个新列,一个包含数值数据,另一个包含LOQ标志。就像这样:
日期 | 站点 | 参数1_org | 参数1_new | 参数1_loq | 参数2_org | 参数2_new | 参数2_loq |
---|---|---|---|---|---|---|---|
2020-01-01 | A | <5 | 5 | < | 45 | 45 | = |
2020-02-01 | B | <5 | 5 | < | 47 | 47 | = |
我尝试过使用mutate
(dplyr),但我不知道如何在mutate
和across
中使用条件和gsub
。我也考虑过使用apply
和参数列表,但在代码中迷失了方向。
我需要一些关于选择哪种方法以及如何实现这一目标的建议。我感谢提供的所有帮助!
英文:
This is the kind of data I have:
Date | Station | Param1 | Param2 |
---|---|---|---|
2020-01-01 | A | <5 | 45 |
2020-02-01 | B | <5 | 47 |
To be able to plot this data, mark the LOQ-values (<5) and compute some basic statistics, I need to create new columns with the LOQ-flag (<) and numeric values separated.
I don't have exact knowledge of the Param-names (they are actually "Fe", "Cu", "N-tot" and so on), so I would like to loop over the Param-columns (not Date and Station) and create two new columns for each Param, one with the numerical data and one with the LOQ-flag. Like this:
Date | Station | Param1_org | Param1_new | Param1_loq | Param2_org | Param2_new | Param2_loq |
---|---|---|---|---|---|---|---|
2020-01-01 | A | <5 | 5 | < | 45 | 45 | = |
2020-02-01 | B | <5 | 5 | < | 47 | 47 | = |
I have tried mutate (dplyr) but I am struggeling with how to use the conditions together with gsub inside mutate and across. I also considered using apply and a list of Params, but got lost in the code.
I need some advice on which approach to choose, and a simple example of how to achieve this. I appreciate all help given!
答案1
得分: 0
以下是您要翻译的内容:
这是您问题的答案
library(tidyverse)
data <- tibble(Date = c(as.Date("2020-01-01"), as.Date("2020-02-01")),
Station = c("A", "B"),
Param1 = c("<5", "<5"),
Param2 = c("45", "47"))
cols <- colnames(data)
param_cols <- cols[str_detect(cols, "^Param")]
for (col in param_cols) {
col_name <- paste(col, "org", sep = "_")
col_new <- paste(col, "new", sep = "_")
col_loq <- paste(col, "loq", sep = "_")
data <-data %>%
mutate(!!col_name := get(col),
!!col_new := str_extract(get(col), "\\d+"),
!!col_loq := ifelse(str_detect(get(col), "^\\d"),
"=",
ifelse(str_detect(get(col), "^<"), "<", ">")
),
!!col := NULL
)
}
print(data)
我所做的只是简单地遍历所有包含Param的列,并使用mutate(再次使用另一个正则表达式检测)。!!
只是为了对一个变量进行转义,以便在dplyr参数上使用(注意:dplyr版本1.0或更高版本)。
英文:
Here's the answer of your question
library(tidyverse)
data <- tibble(Date = c(as.Date("2020-01-01"), as.Date("2020-02-01")),
Station = c("A", "B"),
Param1 = c("<5", "<5"),
Param2 = c("45", "47"))
cols <- colnames(data)
param_cols <- cols[str_detect(cols, "^Param")]
for (col in param_cols) {
col_name <- paste(col, "org", sep = "_")
col_new<- paste(col, "new", sep = "_")
col_loq <- paste(col, "loq", sep = "_")
data <-data %>%
mutate(!!col_name := get(col),
!!col_new := str_extract(get(col), "\\d+"),
!!col_loq := ifelse(str_detect(get(col), "^\\d"),
"=",
ifelse(str_detect(get(col), "^<"), "<", ">")
),
!!col := NULL
)
}
print(data)
What I did is simply looping through all the columns contain Param and using mutate (again with another regex detection). The !!
is just escaping for a variable to be able for being used on dplyr argument (note: dplyr version 1.0 or higher)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论