操作数据框并总结

huangapple go评论64阅读模式
英文:

Manipulate dataframe and summarise

问题

我可以帮你处理这个问题。首先,你需要根据ID列过滤数据框,然后根据每个ID组的标志(flags)进行评估。接下来,你需要根据标志的情况删除ID组或删除不重要的记录,最后计算每个ID组的中位数。

下面是一种可能的解决方法,使用dplyr和tidyr包来处理数据:

library(dplyr)
library(tidyr)

# 创建数据框
ID <- c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")
Date <- c("01-03-2015","01-03-2015","01-03-2015",
          "04-06-2014","04-06-2014","04-06-2014",
          "07-01-2019","07-01-2019","07-01-2019")
TimeUTC <- c("10:22:05","10:22:05","10:22:05",
              "10:25:05","10:25:05","10:25:05",
              "10:23:05","10:23:05","10:23:05")
V1 <- c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2 <- c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3 <- c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4 <- c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
Flag1 <- c(0,0,0,1,0,1,0,0,1)
Flag2 <- c(0,0,0,0,0,1,1,0,0)
Flag3 <- c(0,0,0,0,0,0,1,1,0)

df1 <- data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)

# 根据ID列进行分组
df_filtered <- df1 %>%
  group_by(ID) %>%
  # 计算每个ID组中标志的总和
  summarise(Flag1_sum = sum(Flag1),
            Flag2_sum = sum(Flag2),
            Flag3_sum = sum(Flag3)) %>%
  # 根据标志情况过滤数据
  filter(Flag1_sum == 0 & Flag2_sum == 0 & Flag3_sum == 0) %>%
  # 只保留ID列
  select(ID)

# 使用过滤后的ID来筛选原始数据
filtered_data <- df1 %>%
  semi_join(df_filtered, by = "ID")

# 计算每个ID组的中位数
median_data <- filtered_data %>%
  group_by(ID) %>%
  summarise(across(starts_with("V"), median))

# 输出结果
print(median_data)

这段代码首先对每个ID组的标志进行汇总,然后过滤掉包含任何标志的ID组。接下来,它计算每个ID组的各列的中位数,并输出结果。

英文:

I have a large dataframe that I need to filter and calculate some summary statistics, the df looks similar to this:

ID&lt;-c(&quot;KMT1&quot;,&quot;KMT1&quot;,&quot;KMT1&quot;,&quot;KMT2&quot;,&quot;KMT2&quot;,&quot;KMT2&quot;,&quot;KMT3&quot;,&quot;KMT3&quot;,&quot;KMT3&quot;)
Date&lt;-c(&quot;01-03-2015&quot;,&quot;01-03-2015&quot;,&quot;01-03-2015&quot;,
&quot;04-06-2014&quot;,&quot;04-06-2014&quot;,&quot;04-06-2014&quot;,
&quot;07-01-2019&quot;,&quot;07-01-2019&quot;,&quot;07-01-2019&quot;)
TimeUTC&lt;-c(&quot;10:22:05&quot;,&quot;10:22:05&quot;,&quot;10:22:05&quot;,
&quot;10:25:05&quot;,&quot;10:25:05&quot;,&quot;10:25:05&quot;,
&quot;10:23:05&quot;,&quot;10:23:05&quot;,&quot;10:23:05&quot;)
V1&lt;-c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2&lt;-c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3&lt;-c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4&lt;-c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
Flag1&lt;-c(0,0,0,1,0,1,0,0,1)
Flag2&lt;-c(0,0,0,0,0,1,1,0,0)
Flag3&lt;-c(0,0,0,0,0,0,1,1,0)
df1&lt;-data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
df1
ID       Date  TimeUTC    V1    V2    V3    V4 Flag1 Flag2 Flag3
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080     0     0     0
2 KMT1 01-03-2015 10:22:05 0.003 0.002 0.008 0.009     0     0     0
3 KMT1 01-03-2015 10:22:05 0.040 0.020 0.060 0.080     0     0     0
4 KMT2 04-06-2014 10:25:05 0.030 0.003 0.090 0.090     1     0     0
5 KMT2 04-06-2014 10:25:05 0.020 0.002 0.004 0.004     0     0     0
6 KMT2 04-06-2014 10:25:05 0.050 0.090 0.050 0.050     1     1     0
7 KMT3 07-01-2019 10:23:05 0.030 0.010 0.010 0.050     0     1     1
8 KMT3 07-01-2019 10:23:05 0.100 0.050 0.003 0.030     0     0     1
9 KMT3 07-01-2019 10:23:05 0.020 0.023 0.040 0.100     1     0     0

I would like to be able to filter the df based on the ID column so that I can assess if and how many flags active on the ID (0/1) and if so, which flags. I then need to remove any ID group if a flag is active or remove 1 of the entries if the flag is not important e.g., only flag3 is active.

After filtering the data I would like to calculate the median of each ID, so that the df would have 2 rows. The desired output would look something like this:

(A) Removed ID(KMT3) as flag 3 was active

 ID       Date  TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05  0.010  0.020  0.040  0.080
2 KMT2 04-06-2014 10:25:05  0.025  0.025  0.047  0.047

(B) Removed ID(KMT2 and KMT3) as flags are active

    ID       Date  TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05   0.01   0.02   0.04   0.08

I am new to R and not sure on the best way to approach this problem, I have tried using filter() from dplyr package which can remove all rows which contain an active flag (e.g. Flag1 = 1), however, I need to assess which flags are active and how any before I can remove the ID.

I have managed to calculate the median of each column using:

DT1&lt;-df2 %&gt;%
select(&quot;ID&quot;,V1:V4)%&gt;%
group_by(ID)%&gt;%
data.table()%&gt;%
na.omit()
setnames(DT1[, sapply(.SD, function(x) list(median(x))), by=ID], c(&quot;ID&quot;, sapply(names(DT1)[-1], paste0, c(&quot;.median&quot;))))

However, doing this I lose the Date and Time columns which are important for further processes.

Any help would be greatly appreciated.

答案1

得分: 0

以下是获取每个ID的1行代码,包括总旗帜数、每种类型的旗帜数以及“V*”列的中位数:

library(dplyr)
df1 |&gt;
  summarize(
    across(starts_with(&quot;Flag&quot;), sum, .names = &quot;n_{.col}&quot;),
    across(V1:V4, median, .names = &quot;med_{.col}&quot;),
    .by = c(ID, Date, TimeUTC)
  ) |&gt;
  mutate(
    n_total_flags = rowSums(across(starts_with(&quot;n_&quot;))),
  )
#     ID       Date  TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
# 1 KMT1 01-03-2015 10:22:05       0       0       0   0.01  0.020   0.04   0.08             0
# 2 KMT2 04-06-2014 10:25:05       2       1       0   0.03  0.003   0.05   0.05             3
# 3 KMT3 07-01-2019 10:23:05       1       1       2   0.03  0.023   0.01   0.05             4

您可以使用此结果执行所需的逻辑,并将其连接回原始数据以进行筛选。

英文:

Here's some code that gets you 1 row per ID, including the total number of flags, the number of flags of each type, and the median of the V* columns:

library(dplyr)
df1 |&gt;
summarize(
across(starts_with(&quot;Flag&quot;), sum, .names = &quot;n_{.col}&quot;),
across(V1:V4, median, .names = &quot;med_{.col}&quot;),
.by = c(ID, Date, TimeUTC)
) |&gt;
mutate(
n_total_flags = rowSums(across(starts_with(&quot;n_&quot;))),
)
#     ID       Date  TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
# 1 KMT1 01-03-2015 10:22:05       0       0       0   0.01  0.020   0.04   0.08             0
# 2 KMT2 04-06-2014 10:25:05       2       1       0   0.03  0.003   0.05   0.05             3
# 3 KMT3 07-01-2019 10:23:05       1       1       2   0.03  0.023   0.01   0.05             4

You could use this result to do whatever logic you need to and join back to the original data to do filtering.

huangapple
  • 本文由 发表于 2023年6月16日 02:14:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484455.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定