操作数据框并总结

huangapple go评论93阅读模式
英文:

Manipulate dataframe and summarise

问题

我可以帮你处理这个问题。首先,你需要根据ID列过滤数据框,然后根据每个ID组的标志(flags)进行评估。接下来,你需要根据标志的情况删除ID组或删除不重要的记录,最后计算每个ID组的中位数。

下面是一种可能的解决方法,使用dplyr和tidyr包来处理数据:

  1. library(dplyr)
  2. library(tidyr)
  3. # 创建数据框
  4. ID <- c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")
  5. Date <- c("01-03-2015","01-03-2015","01-03-2015",
  6. "04-06-2014","04-06-2014","04-06-2014",
  7. "07-01-2019","07-01-2019","07-01-2019")
  8. TimeUTC <- c("10:22:05","10:22:05","10:22:05",
  9. "10:25:05","10:25:05","10:25:05",
  10. "10:23:05","10:23:05","10:23:05")
  11. V1 <- c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
  12. V2 <- c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
  13. V3 <- c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
  14. V4 <- c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
  15. Flag1 <- c(0,0,0,1,0,1,0,0,1)
  16. Flag2 <- c(0,0,0,0,0,1,1,0,0)
  17. Flag3 <- c(0,0,0,0,0,0,1,1,0)
  18. df1 <- data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
  19. # 根据ID列进行分组
  20. df_filtered <- df1 %>%
  21. group_by(ID) %>%
  22. # 计算每个ID组中标志的总和
  23. summarise(Flag1_sum = sum(Flag1),
  24. Flag2_sum = sum(Flag2),
  25. Flag3_sum = sum(Flag3)) %>%
  26. # 根据标志情况过滤数据
  27. filter(Flag1_sum == 0 & Flag2_sum == 0 & Flag3_sum == 0) %>%
  28. # 只保留ID列
  29. select(ID)
  30. # 使用过滤后的ID来筛选原始数据
  31. filtered_data <- df1 %>%
  32. semi_join(df_filtered, by = "ID")
  33. # 计算每个ID组的中位数
  34. median_data <- filtered_data %>%
  35. group_by(ID) %>%
  36. summarise(across(starts_with("V"), median))
  37. # 输出结果
  38. print(median_data)

这段代码首先对每个ID组的标志进行汇总,然后过滤掉包含任何标志的ID组。接下来,它计算每个ID组的各列的中位数,并输出结果。

英文:

I have a large dataframe that I need to filter and calculate some summary statistics, the df looks similar to this:

  1. ID&lt;-c(&quot;KMT1&quot;,&quot;KMT1&quot;,&quot;KMT1&quot;,&quot;KMT2&quot;,&quot;KMT2&quot;,&quot;KMT2&quot;,&quot;KMT3&quot;,&quot;KMT3&quot;,&quot;KMT3&quot;)
  2. Date&lt;-c(&quot;01-03-2015&quot;,&quot;01-03-2015&quot;,&quot;01-03-2015&quot;,
  3. &quot;04-06-2014&quot;,&quot;04-06-2014&quot;,&quot;04-06-2014&quot;,
  4. &quot;07-01-2019&quot;,&quot;07-01-2019&quot;,&quot;07-01-2019&quot;)
  5. TimeUTC&lt;-c(&quot;10:22:05&quot;,&quot;10:22:05&quot;,&quot;10:22:05&quot;,
  6. &quot;10:25:05&quot;,&quot;10:25:05&quot;,&quot;10:25:05&quot;,
  7. &quot;10:23:05&quot;,&quot;10:23:05&quot;,&quot;10:23:05&quot;)
  8. V1&lt;-c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
  9. V2&lt;-c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
  10. V3&lt;-c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
  11. V4&lt;-c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
  12. Flag1&lt;-c(0,0,0,1,0,1,0,0,1)
  13. Flag2&lt;-c(0,0,0,0,0,1,1,0,0)
  14. Flag3&lt;-c(0,0,0,0,0,0,1,1,0)
  15. df1&lt;-data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
  16. df1
  17. ID Date TimeUTC V1 V2 V3 V4 Flag1 Flag2 Flag3
  18. 1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080 0 0 0
  19. 2 KMT1 01-03-2015 10:22:05 0.003 0.002 0.008 0.009 0 0 0
  20. 3 KMT1 01-03-2015 10:22:05 0.040 0.020 0.060 0.080 0 0 0
  21. 4 KMT2 04-06-2014 10:25:05 0.030 0.003 0.090 0.090 1 0 0
  22. 5 KMT2 04-06-2014 10:25:05 0.020 0.002 0.004 0.004 0 0 0
  23. 6 KMT2 04-06-2014 10:25:05 0.050 0.090 0.050 0.050 1 1 0
  24. 7 KMT3 07-01-2019 10:23:05 0.030 0.010 0.010 0.050 0 1 1
  25. 8 KMT3 07-01-2019 10:23:05 0.100 0.050 0.003 0.030 0 0 1
  26. 9 KMT3 07-01-2019 10:23:05 0.020 0.023 0.040 0.100 1 0 0

I would like to be able to filter the df based on the ID column so that I can assess if and how many flags active on the ID (0/1) and if so, which flags. I then need to remove any ID group if a flag is active or remove 1 of the entries if the flag is not important e.g., only flag3 is active.

After filtering the data I would like to calculate the median of each ID, so that the df would have 2 rows. The desired output would look something like this:

(A) Removed ID(KMT3) as flag 3 was active

  1. ID Date TimeUTC V1.med V2.med V3.med V4.med
  2. 1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080
  3. 2 KMT2 04-06-2014 10:25:05 0.025 0.025 0.047 0.047

(B) Removed ID(KMT2 and KMT3) as flags are active

  1. ID Date TimeUTC V1.med V2.med V3.med V4.med
  2. 1 KMT1 01-03-2015 10:22:05 0.01 0.02 0.04 0.08

I am new to R and not sure on the best way to approach this problem, I have tried using filter() from dplyr package which can remove all rows which contain an active flag (e.g. Flag1 = 1), however, I need to assess which flags are active and how any before I can remove the ID.

I have managed to calculate the median of each column using:

  1. DT1&lt;-df2 %&gt;%
  2. select(&quot;ID&quot;,V1:V4)%&gt;%
  3. group_by(ID)%&gt;%
  4. data.table()%&gt;%
  5. na.omit()
  6. setnames(DT1[, sapply(.SD, function(x) list(median(x))), by=ID], c(&quot;ID&quot;, sapply(names(DT1)[-1], paste0, c(&quot;.median&quot;))))

However, doing this I lose the Date and Time columns which are important for further processes.

Any help would be greatly appreciated.

答案1

得分: 0

以下是获取每个ID的1行代码,包括总旗帜数、每种类型的旗帜数以及“V*”列的中位数:

  1. library(dplyr)
  2. df1 |&gt;
  3. summarize(
  4. across(starts_with(&quot;Flag&quot;), sum, .names = &quot;n_{.col}&quot;),
  5. across(V1:V4, median, .names = &quot;med_{.col}&quot;),
  6. .by = c(ID, Date, TimeUTC)
  7. ) |&gt;
  8. mutate(
  9. n_total_flags = rowSums(across(starts_with(&quot;n_&quot;))),
  10. )
  11. # ID Date TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
  12. # 1 KMT1 01-03-2015 10:22:05 0 0 0 0.01 0.020 0.04 0.08 0
  13. # 2 KMT2 04-06-2014 10:25:05 2 1 0 0.03 0.003 0.05 0.05 3
  14. # 3 KMT3 07-01-2019 10:23:05 1 1 2 0.03 0.023 0.01 0.05 4

您可以使用此结果执行所需的逻辑,并将其连接回原始数据以进行筛选。

英文:

Here's some code that gets you 1 row per ID, including the total number of flags, the number of flags of each type, and the median of the V* columns:

  1. library(dplyr)
  2. df1 |&gt;
  3. summarize(
  4. across(starts_with(&quot;Flag&quot;), sum, .names = &quot;n_{.col}&quot;),
  5. across(V1:V4, median, .names = &quot;med_{.col}&quot;),
  6. .by = c(ID, Date, TimeUTC)
  7. ) |&gt;
  8. mutate(
  9. n_total_flags = rowSums(across(starts_with(&quot;n_&quot;))),
  10. )
  11. # ID Date TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
  12. # 1 KMT1 01-03-2015 10:22:05 0 0 0 0.01 0.020 0.04 0.08 0
  13. # 2 KMT2 04-06-2014 10:25:05 2 1 0 0.03 0.003 0.05 0.05 3
  14. # 3 KMT3 07-01-2019 10:23:05 1 1 2 0.03 0.023 0.01 0.05 4

You could use this result to do whatever logic you need to and join back to the original data to do filtering.

huangapple
  • 本文由 发表于 2023年6月16日 02:14:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484455.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定