使用R进行自定义顺序的数据排序

huangapple go评论96阅读模式
英文:

Sorting Data with custom sequence in R

问题

我有一个问题,需要按文件名对数据库中的数据进行排序。
首先,我有一个包含20个netcdf数据的文件列表。这是数据列表:

  1. file_all <- list.files(pattern=glob2rx("*.nc"))
  2. > file_all
  3. [1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
  4. [3] "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
  5. [5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-jja-tas-cru.nc"
  6. [7] "bs-ecearthcclm-mam-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
  7. [9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc"
  8. [11] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
  9. [13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
  10. [15] "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
  11. [17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-jja-tas-cru.nc"
  12. [19] "bs-noresmremo-mam-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"

我需要首先对它进行排序,因为我需要顺序的数据。我想要的顺序是:

  1. djf -> mam -> jja -> son

这是我期望的输出顺序:

  1. [1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc"
  2. [3] "bs-cnrmaladin-jja-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
  3. [5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
  4. [7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
  5. [9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc"
  6. [11] "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
  7. [13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc"
  8. [15] "bs-mpiracmo-jja-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
  9. [17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
  10. [19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"

如果有人知道如何按照这个顺序排序,请帮助我。
提前感谢。

英文:

I have a problem with sorting data base on the name of the files.
First I have list file containing 20 netcdf data. This is the data list :

  1. file_all <- list.files(pattern=glob2rx("*.nc"))
  2. > file_all
  3. [1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
  4. [3] "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
  5. [5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-jja-tas-cru.nc"
  6. [7] "bs-ecearthcclm-mam-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
  7. [9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc"
  8. [11] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
  9. [13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
  10. [15] "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
  11. [17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-jja-tas-cru.nc"
  12. [19] "bs-noresmremo-mam-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"

I need to sort it first because I need sequential data. The order that I want is:

  1. djf -> mam -> jja -> son

This is my expected output order:

  1. [1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc"
  2. [3] "bs-cnrmaladin-jja-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
  3. [5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
  4. [7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
  5. [9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc"
  6. [11] "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
  7. [13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc"
  8. [15] "bs-mpiracmo-jja-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
  9. [17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
  10. [19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"

If anyone know how to sort with this sequence please help me.
Thanks in advance

答案1

得分: 2

以下是您要翻译的代码部分:

你可以提取字符串的这一部分并与你的顺序进行匹配,即

  1. my_order <- c("djf", "mam", "jja", "son")
  2. i1 <- gsub(".*-(.*)-tas.*", "\", file_all)
  3. i2 <- gsub("bs-([^-]*)-.*", "\", file_all)
  4. file_all[order(i2, match(i1, my_order))]

一些替代方法包括:

使用 strsplit。这可以提高效率,因为它不使用正则表达式。

  1. split_files <- strsplit(strings, "-")
  2. i3 <- sapply(split_files, "[[", 3)
  3. i4 <- sapply(split_files, "[[", 2)
  4. file_all[order(i4, match(i3, my_order))]
英文:

You can extract that part of the string and match it with your order, i.e.

  1. my_order &lt;- c(&quot;djf&quot;, &quot;mam&quot;, &quot;jja&quot;, &quot;son&quot;)
  2. i1 &lt;- gsub(&quot;.*-(.*)-tas.*&quot;, &quot;\&quot;, file_all)
  3. i2 &lt;- gsub(&quot;bs-([^-]*)-.*&quot;, &quot;\&quot;, file_all)
  4. file_all[order(i2, match(i1, my_order))]
  5. [1] &quot;bs-cnrmaladin-djf-tas-cru.nc&quot; &quot;bs-cnrmaladin-mam-tas-cru.nc&quot; &quot;bs-cnrmaladin-jja-tas-cru.nc&quot; &quot;bs-cnrmaladin-son-tas-cru.nc&quot;
  6. [5] &quot;bs-ecearthcclm-djf-tas-cru.nc&quot; &quot;bs-ecearthcclm-mam-tas-cru.nc&quot; &quot;bs-ecearthcclm-jja-tas-cru.nc&quot; &quot;bs-ecearthcclm-son-tas-cru.nc&quot;
  7. [9] &quot;bs-hadgemhirham-djf-tas-cru.nc&quot; &quot;bs-hadgemhirham-mam-tas-cru.nc&quot; &quot;bs-hadgemhirham-jja-tas-cru.nc&quot; &quot;bs-hadgemhirham-son-tas-cru.nc&quot;
  8. [13] &quot;bs-mpiracmo-djf-tas-cru.nc&quot; &quot;bs-mpiracmo-mam-tas-cru.nc&quot; &quot;bs-mpiracmo-jja-tas-cru.nc&quot; &quot;bs-mpiracmo-son-tas-cru.nc&quot;
  9. [17] &quot;bs-noresmremo-djf-tas-cru.nc&quot; &quot;bs-noresmremo-mam-tas-cru.nc&quot; &quot;bs-noresmremo-jja-tas-cru.nc&quot; &quot;bs-noresmremo-son-tas-cru.nc&quot;

A few alternatives would be:

To use strsplit. This could improve efficiency as it doesn't use regular expressions.

  1. split_files &lt;- strsplit(strings, &quot;-&quot;)
  2. i3 &lt;- sapply(split_files, &quot;[[&quot;, 3)
  3. i4 &lt;- sapply(split_files, &quot;[[&quot;, 2)
  4. file_all[order(i4, match(i3, my_order))]

答案2

得分: 1

A little bit verbose, but you could use a tidyverse based solution:

  1. library(dplyr)
  2. library(tidyr)
  3. file_all %>%
  4. data.frame(model = .) %>%
  5. separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
  6. mutate(model3 = match(model3, my_order)) %>%
  7. arrange(model2, model3) %>%
  8. pull(modelmodel)

This returns:

  1. [1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
  2. [4] "bs-cnrmaladin-son-tas-cru.nc" "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
  3. [7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc" "bs-hadgemhirham-djf-tas-cru.nc"
  4. [10] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
  5. [13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
  6. [16] "bs-mpiracmo-son-tas-cru.nc" "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
  7. [19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"

Data:

  1. file_all <- c("bs-cnrmaladin-djf-tas-cru.nc", "bs-cnrmaladin-jja-tas-cru.nc",
  2. "bs-cnrmaladin-mam-tas-cru.nc", "bs-cnrmaladin-son-tas-cru.nc",
  3. "bs-ecearthcclm-djf-tas-cru.nc", "bs-ecearthcclm-jja-tas-cru.nc",
  4. "bs-ecearthcclm-mam-tas-cru.nc", "bs-ecearthcclm-son-tas-cru.nc",
  5. "bs-hadgemhirham-djf-tas-cru.nc", "bs-hadgemhirham-jja-tas-cru.nc",
  6. "bs-hadgemhirham-mam-tas-cru.nc", "bs-hadgemhirham-son-tas-cru.nc",
  7. "bs-mpiracmo-djf-tas-cru.nc", "bs-mpiracmo-jja-tas-cru.nc",
  8. "bs-mpiracmo-mam-tas-cru.nc", "bs-mpiracmo-son-tas-cru.nc",
  9. "bs-noresmremo-djf-tas-cru.nc", "bs-noresmremo-jja-tas-cru.nc",
  10. "bs-noresmremo-mam-tas-cru.nc", "bs-noresmremo-son-tas-cru.nc")
  11. my_order <- c("djf", "mam", "jja", "son")

Benchmark:

  1. library(microbenchmark)
  2. my_order <- c("djf", "mam", "jja", "son")
  3. my_files_2 <- sample(file_all, 100000, replace = TRUE)
  4. microbenchmark(
  5. sotos = my_files_2[order(match(gsub(".*-(.*)-tas.*", "\", my_files_2), my_order))],
  6. sotos_2 = { split_files <- strsplit(my_files_2, "-")
  7. i3 <- unlist(lapply(split_files, "[[", 3))
  8. i4 <- unlist(lapply(split_files, "[[", 2))
  9. my_files_2[order(i4, match(i3, my_order))] },
  10. sotos_3 = { split_files <- strsplit(my_files_2, "-", fixed = TRUE)
  11. i3 <- unlist(lapply(split_files, "[[", 3))
  12. i4 <- unlist(lapply(split_files, "[[", 2))
  13. my_files_2[order(i4, match(i3, my_order))] },
  14. alt = my_files_2[unlist(lapply(my_order, \(n) which(grepl(n, my_files_2))))],
  15. tidyverse = my_files_2 %>%
  16. data.frame(model = .) %>%
  17. separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
  18. mutate(model3 = match(model3, my_order)) %>%
  19. arrange(model2, model3) %>%
  20. pull(modelmodel)
  21. )

Benchmark results based on the size of my_list_2 are also provided.

英文:

A little bit verbose, but you could use a tidyverse based solution:

  1. library(dplyr)
  2. library(tidyr)
  3. file_all %&gt;%
  4. data.frame(model = .) %&gt;%
  5. separate_wider_delim(model, delim = &quot;-&quot;, names_sep = &quot;&quot;, cols_remove = FALSE) %&gt;%
  6. mutate(model3 = match(model3, my_order)) %&gt;%
  7. arrange(model2, model3) %&gt;%
  8. pull(modelmodel)

This returns

  1. [1] &quot;bs-cnrmaladin-djf-tas-cru.nc&quot; &quot;bs-cnrmaladin-mam-tas-cru.nc&quot; &quot;bs-cnrmaladin-jja-tas-cru.nc&quot;
  2. [4] &quot;bs-cnrmaladin-son-tas-cru.nc&quot; &quot;bs-ecearthcclm-djf-tas-cru.nc&quot; &quot;bs-ecearthcclm-mam-tas-cru.nc&quot;
  3. [7] &quot;bs-ecearthcclm-jja-tas-cru.nc&quot; &quot;bs-ecearthcclm-son-tas-cru.nc&quot; &quot;bs-hadgemhirham-djf-tas-cru.nc&quot;
  4. [10] &quot;bs-hadgemhirham-mam-tas-cru.nc&quot; &quot;bs-hadgemhirham-jja-tas-cru.nc&quot; &quot;bs-hadgemhirham-son-tas-cru.nc&quot;
  5. [13] &quot;bs-mpiracmo-djf-tas-cru.nc&quot; &quot;bs-mpiracmo-mam-tas-cru.nc&quot; &quot;bs-mpiracmo-jja-tas-cru.nc&quot;
  6. [16] &quot;bs-mpiracmo-son-tas-cru.nc&quot; &quot;bs-noresmremo-djf-tas-cru.nc&quot; &quot;bs-noresmremo-mam-tas-cru.nc&quot;
  7. [19] &quot;bs-noresmremo-jja-tas-cru.nc&quot; &quot;bs-noresmremo-son-tas-cru.nc&quot;

Data

  1. file_all &lt;- c(&quot;bs-cnrmaladin-djf-tas-cru.nc&quot; ,&quot;bs-cnrmaladin-jja-tas-cru.nc&quot; ,
  2. &quot;bs-cnrmaladin-mam-tas-cru.nc&quot; ,&quot;bs-cnrmaladin-son-tas-cru.nc&quot; ,
  3. &quot;bs-ecearthcclm-djf-tas-cru.nc&quot; ,&quot;bs-ecearthcclm-jja-tas-cru.nc&quot; ,
  4. &quot;bs-ecearthcclm-mam-tas-cru.nc&quot; ,&quot;bs-ecearthcclm-son-tas-cru.nc&quot; ,
  5. &quot;bs-hadgemhirham-djf-tas-cru.nc&quot; ,&quot;bs-hadgemhirham-jja-tas-cru.nc&quot;,
  6. &quot;bs-hadgemhirham-mam-tas-cru.nc&quot; ,&quot;bs-hadgemhirham-son-tas-cru.nc&quot;,
  7. &quot;bs-mpiracmo-djf-tas-cru.nc&quot; ,&quot;bs-mpiracmo-jja-tas-cru.nc&quot; ,
  8. &quot;bs-mpiracmo-mam-tas-cru.nc&quot; ,&quot;bs-mpiracmo-son-tas-cru.nc&quot; ,
  9. &quot;bs-noresmremo-djf-tas-cru.nc&quot; ,&quot;bs-noresmremo-jja-tas-cru.nc&quot; ,
  10. &quot;bs-noresmremo-mam-tas-cru.nc&quot; ,&quot;bs-noresmremo-son-tas-cru.nc&quot; )
  11. my_order &lt;- c(&quot;djf&quot;, &quot;mam&quot;, &quot;jja&quot;, &quot;son&quot;)

Benchmark

Since I was curious I did a small benchmark of @Sotos answer vs. a variation of this answer and the tidyverse based solution:

  1. library(microbenchmark)
  2. my_order &lt;- c(&quot;djf&quot;, &quot;mam&quot;, &quot;jja&quot;, &quot;son&quot;)
  3. my_files_2 &lt;- sample(file_all, 100000, replace = TRUE)
  4. microbenchmark(
  5. sotos = my_files_2[order(match(gsub(&quot;.*-(.*)-tas.*&quot;, &quot;\&quot;, my_files_2), my_order))],
  6. sotos_2 = { split_files &lt;- strsplit(my_files_2, &quot;-&quot;)
  7. i3 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 3))
  8. i4 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 2))
  9. my_files_2[order(i4, match(i3, my_order))] },
  10. sotos_3 = { split_files &lt;- strsplit(my_files_2, &quot;-&quot;, fixed = TRUE)
  11. i3 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 3))
  12. i4 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 2))
  13. my_files_2[order(i4, match(i3, my_order))] },
  14. alt = my_files_2[unlist(lapply(my_order, \(n) which(grepl(n, my_files_2))))],
  15. tidyverse = my_files_2 %&gt;%
  16. data.frame(model = .) %&gt;%
  17. separate_wider_delim(model, delim = &quot;-&quot;, names_sep = &quot;&quot;, cols_remove = FALSE) %&gt;%
  18. mutate(model3 = match(model3, my_order)) %&gt;%
  19. arrange(model2, model3) %&gt;%
  20. pull(modelmodel)
  21. )

The result surprised me quite a bit

  1. #&gt; Unit: milliseconds
  2. #&gt; expr min lq mean median uq max neval
  3. #&gt; sotos 126.5779 127.5660 129.03582 128.17515 128.83190 199.8997 100
  4. #&gt; sotos_2 403.7318 412.5335 430.80871 419.50455 433.35235 511.1252 100
  5. #&gt; sotos_3 220.1570 228.3046 244.84893 235.50230 246.01510 314.8171 100
  6. #&gt; alt 113.9459 114.7081 115.86338 115.50140 116.37555 121.4351 100
  7. #&gt; tidyverse 75.5209 83.2439 93.82626 85.46735 92.48615 183.7602 100

Here are a few benchmark results based on the size of my_list_2:

  1. my_files_2 &lt;- sample(file_all, 100, replace = TRUE)
  2. #&gt; Unit: microseconds
  3. #&gt; expr min lq mean median uq max neval
  4. #&gt; sotos 155.6 162.95 175.719 173.70 184.05 263.3 100
  5. #&gt; sotos_2 399.6 416.35 446.641 427.35 448.80 705.4 100
  6. #&gt; sotos_3 217.6 229.60 278.047 237.20 252.25 3846.8 100
  7. #&gt; alt 139.6 150.00 169.377 154.75 160.30 1452.0 100
  8. #&gt; tidyverse 7912.6 8265.80 8927.205 8529.85 8987.45 14056.3 100
  9. my_files_2 &lt;- sample(file_all, 1000, replace = TRUE)
  10. #&gt; Unit: milliseconds
  11. #&gt; expr min lq mean median uq max neval
  12. #&gt; sotos 1.2786 1.31485 1.368087 1.33890 1.37390 1.9579 100
  13. #&gt; sotos_2 3.6715 3.77725 3.849989 3.83145 3.90755 4.3005 100
  14. #&gt; sotos_3 1.8969 1.94865 2.004843 1.97525 2.01915 2.4871 100
  15. #&gt; alt 1.1500 1.18695 1.229627 1.20130 1.22640 2.4962 100
  16. #&gt; tidyverse 8.6329 8.99985 9.710873 9.21335 9.63085 14.6684 100
  17. my_files_2 &lt;- sample(file_all, 10000, replace = TRUE)
  18. #&gt; Unit: milliseconds
  19. #&gt; expr min lq mean median uq max neval
  20. #&gt; sotos 12.5224 12.81450 12.92100 12.89505 13.00510 13.5006 100
  21. #&gt; sotos_2 37.5119 38.09420 38.98338 38.34025 38.76710 51.7555 100
  22. #&gt; sotos_3 19.3073 19.70905 20.63162 19.88925 20.29015 25.1160 100
  23. #&gt; alt 11.2854 11.54560 11.74392 11.63175 11.72850 15.7317 100
  24. #&gt; tidyverse 14.4663 15.03035 16.42199 15.27875 15.49390 83.9643 100

At a sample size of ~20000 the tidyverse solution seems to take the lead.

huangapple
  • 本文由 发表于 2023年4月6日 20:02:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75949307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定