英文:
Manipulate dataframe and summarise
问题
我可以帮你处理这个问题。首先,你需要根据ID列过滤数据框,然后根据每个ID组的标志(flags)进行评估。接下来,你需要根据标志的情况删除ID组或删除不重要的记录,最后计算每个ID组的中位数。
下面是一种可能的解决方法,使用dplyr和tidyr包来处理数据:
library(dplyr)
library(tidyr)
# 创建数据框
ID <- c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")
Date <- c("01-03-2015","01-03-2015","01-03-2015",
"04-06-2014","04-06-2014","04-06-2014",
"07-01-2019","07-01-2019","07-01-2019")
TimeUTC <- c("10:22:05","10:22:05","10:22:05",
"10:25:05","10:25:05","10:25:05",
"10:23:05","10:23:05","10:23:05")
V1 <- c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2 <- c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3 <- c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4 <- c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
Flag1 <- c(0,0,0,1,0,1,0,0,1)
Flag2 <- c(0,0,0,0,0,1,1,0,0)
Flag3 <- c(0,0,0,0,0,0,1,1,0)
df1 <- data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
# 根据ID列进行分组
df_filtered <- df1 %>%
group_by(ID) %>%
# 计算每个ID组中标志的总和
summarise(Flag1_sum = sum(Flag1),
Flag2_sum = sum(Flag2),
Flag3_sum = sum(Flag3)) %>%
# 根据标志情况过滤数据
filter(Flag1_sum == 0 & Flag2_sum == 0 & Flag3_sum == 0) %>%
# 只保留ID列
select(ID)
# 使用过滤后的ID来筛选原始数据
filtered_data <- df1 %>%
semi_join(df_filtered, by = "ID")
# 计算每个ID组的中位数
median_data <- filtered_data %>%
group_by(ID) %>%
summarise(across(starts_with("V"), median))
# 输出结果
print(median_data)
这段代码首先对每个ID组的标志进行汇总,然后过滤掉包含任何标志的ID组。接下来,它计算每个ID组的各列的中位数,并输出结果。
英文:
I have a large dataframe that I need to filter and calculate some summary statistics, the df looks similar to this:
ID<-c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")
Date<-c("01-03-2015","01-03-2015","01-03-2015",
"04-06-2014","04-06-2014","04-06-2014",
"07-01-2019","07-01-2019","07-01-2019")
TimeUTC<-c("10:22:05","10:22:05","10:22:05",
"10:25:05","10:25:05","10:25:05",
"10:23:05","10:23:05","10:23:05")
V1<-c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2<-c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3<-c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4<-c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
Flag1<-c(0,0,0,1,0,1,0,0,1)
Flag2<-c(0,0,0,0,0,1,1,0,0)
Flag3<-c(0,0,0,0,0,0,1,1,0)
df1<-data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
df1
ID Date TimeUTC V1 V2 V3 V4 Flag1 Flag2 Flag3
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080 0 0 0
2 KMT1 01-03-2015 10:22:05 0.003 0.002 0.008 0.009 0 0 0
3 KMT1 01-03-2015 10:22:05 0.040 0.020 0.060 0.080 0 0 0
4 KMT2 04-06-2014 10:25:05 0.030 0.003 0.090 0.090 1 0 0
5 KMT2 04-06-2014 10:25:05 0.020 0.002 0.004 0.004 0 0 0
6 KMT2 04-06-2014 10:25:05 0.050 0.090 0.050 0.050 1 1 0
7 KMT3 07-01-2019 10:23:05 0.030 0.010 0.010 0.050 0 1 1
8 KMT3 07-01-2019 10:23:05 0.100 0.050 0.003 0.030 0 0 1
9 KMT3 07-01-2019 10:23:05 0.020 0.023 0.040 0.100 1 0 0
I would like to be able to filter the df based on the ID column so that I can assess if and how many flags active on the ID (0/1) and if so, which flags. I then need to remove any ID group if a flag is active or remove 1 of the entries if the flag is not important e.g., only flag3 is active.
After filtering the data I would like to calculate the median of each ID, so that the df would have 2 rows. The desired output would look something like this:
(A) Removed ID(KMT3) as flag 3 was active
ID Date TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080
2 KMT2 04-06-2014 10:25:05 0.025 0.025 0.047 0.047
(B) Removed ID(KMT2 and KMT3) as flags are active
ID Date TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05 0.01 0.02 0.04 0.08
I am new to R and not sure on the best way to approach this problem, I have tried using filter() from dplyr package which can remove all rows which contain an active flag (e.g. Flag1 = 1), however, I need to assess which flags are active and how any before I can remove the ID.
I have managed to calculate the median of each column using:
DT1<-df2 %>%
select("ID",V1:V4)%>%
group_by(ID)%>%
data.table()%>%
na.omit()
setnames(DT1[, sapply(.SD, function(x) list(median(x))), by=ID], c("ID", sapply(names(DT1)[-1], paste0, c(".median"))))
However, doing this I lose the Date and Time columns which are important for further processes.
Any help would be greatly appreciated.
答案1
得分: 0
以下是获取每个ID的1行代码,包括总旗帜数、每种类型的旗帜数以及“V*”列的中位数:
library(dplyr)
df1 |>
summarize(
across(starts_with("Flag"), sum, .names = "n_{.col}"),
across(V1:V4, median, .names = "med_{.col}"),
.by = c(ID, Date, TimeUTC)
) |>
mutate(
n_total_flags = rowSums(across(starts_with("n_"))),
)
# ID Date TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
# 1 KMT1 01-03-2015 10:22:05 0 0 0 0.01 0.020 0.04 0.08 0
# 2 KMT2 04-06-2014 10:25:05 2 1 0 0.03 0.003 0.05 0.05 3
# 3 KMT3 07-01-2019 10:23:05 1 1 2 0.03 0.023 0.01 0.05 4
您可以使用此结果执行所需的逻辑,并将其连接回原始数据以进行筛选。
英文:
Here's some code that gets you 1 row per ID, including the total number of flags, the number of flags of each type, and the median of the V*
columns:
library(dplyr)
df1 |>
summarize(
across(starts_with("Flag"), sum, .names = "n_{.col}"),
across(V1:V4, median, .names = "med_{.col}"),
.by = c(ID, Date, TimeUTC)
) |>
mutate(
n_total_flags = rowSums(across(starts_with("n_"))),
)
# ID Date TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
# 1 KMT1 01-03-2015 10:22:05 0 0 0 0.01 0.020 0.04 0.08 0
# 2 KMT2 04-06-2014 10:25:05 2 1 0 0.03 0.003 0.05 0.05 3
# 3 KMT3 07-01-2019 10:23:05 1 1 2 0.03 0.023 0.01 0.05 4
You could use this result to do whatever logic you need to and join back to the original data to do filtering.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论