使用R进行自定义顺序的数据排序

huangapple go评论60阅读模式
英文:

Sorting Data with custom sequence in R

问题

我有一个问题,需要按文件名对数据库中的数据进行排序。
首先,我有一个包含20个netcdf数据的文件列表。这是数据列表:

file_all <- list.files(pattern=glob2rx("*.nc"))
> file_all
 [1] "bs-cnrmaladin-djf-tas-cru.nc"   "bs-cnrmaladin-jja-tas-cru.nc"  
 [3] "bs-cnrmaladin-mam-tas-cru.nc"   "bs-cnrmaladin-son-tas-cru.nc"  
 [5] "bs-ecearthcclm-djf-tas-cru.nc"  "bs-ecearthcclm-jja-tas-cru.nc" 
 [7] "bs-ecearthcclm-mam-tas-cru.nc"  "bs-ecearthcclm-son-tas-cru.nc" 
 [9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc"
[11] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc"     "bs-mpiracmo-jja-tas-cru.nc"    
[15] "bs-mpiracmo-mam-tas-cru.nc"     "bs-mpiracmo-son-tas-cru.nc"    
[17] "bs-noresmremo-djf-tas-cru.nc"   "bs-noresmremo-jja-tas-cru.nc"  
[19] "bs-noresmremo-mam-tas-cru.nc"   "bs-noresmremo-son-tas-cru.nc" 

我需要首先对它进行排序,因为我需要顺序的数据。我想要的顺序是:

djf -> mam -> jja -> son

这是我期望的输出顺序:

[1] "bs-cnrmaladin-djf-tas-cru.nc"   "bs-cnrmaladin-mam-tas-cru.nc"  
[3] "bs-cnrmaladin-jja-tas-cru.nc"   "bs-cnrmaladin-son-tas-cru.nc"  
[5] "bs-ecearthcclm-djf-tas-cru.nc"  "bs-ecearthcclm-mam-tas-cru.nc" 
[7] "bs-ecearthcclm-jja-tas-cru.nc"  "bs-ecearthcclm-son-tas-cru.nc" 
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc"
[11] "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc"     "bs-mpiracmo-mam-tas-cru.nc"    
[15] "bs-mpiracmo-jja-tas-cru.nc"     "bs-mpiracmo-son-tas-cru.nc"    
[17] "bs-noresmremo-djf-tas-cru.nc"   "bs-noresmremo-mam-tas-cru.nc"  
[19] "bs-noresmremo-jja-tas-cru.nc"   "bs-noresmremo-son-tas-cru.nc" 

如果有人知道如何按照这个顺序排序,请帮助我。
提前感谢。

英文:

I have a problem with sorting data base on the name of the files.
First I have list file containing 20 netcdf data. This is the data list :

file_all <- list.files(pattern=glob2rx("*.nc"))
> file_all
 [1] "bs-cnrmaladin-djf-tas-cru.nc"   "bs-cnrmaladin-jja-tas-cru.nc"  
 [3] "bs-cnrmaladin-mam-tas-cru.nc"   "bs-cnrmaladin-son-tas-cru.nc"  
 [5] "bs-ecearthcclm-djf-tas-cru.nc"  "bs-ecearthcclm-jja-tas-cru.nc" 
 [7] "bs-ecearthcclm-mam-tas-cru.nc"  "bs-ecearthcclm-son-tas-cru.nc" 
 [9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc"
[11] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc"     "bs-mpiracmo-jja-tas-cru.nc"    
[15] "bs-mpiracmo-mam-tas-cru.nc"     "bs-mpiracmo-son-tas-cru.nc"    
[17] "bs-noresmremo-djf-tas-cru.nc"   "bs-noresmremo-jja-tas-cru.nc"  
[19] "bs-noresmremo-mam-tas-cru.nc"   "bs-noresmremo-son-tas-cru.nc" 

I need to sort it first because I need sequential data. The order that I want is:

djf -> mam -> jja -> son

This is my expected output order:

[1] "bs-cnrmaladin-djf-tas-cru.nc"   "bs-cnrmaladin-mam-tas-cru.nc"  
[3] "bs-cnrmaladin-jja-tas-cru.nc"   "bs-cnrmaladin-son-tas-cru.nc"  
[5] "bs-ecearthcclm-djf-tas-cru.nc"  "bs-ecearthcclm-mam-tas-cru.nc" 
[7] "bs-ecearthcclm-jja-tas-cru.nc"  "bs-ecearthcclm-son-tas-cru.nc" 
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc"
[11] "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc"     "bs-mpiracmo-mam-tas-cru.nc"    
[15] "bs-mpiracmo-jja-tas-cru.nc"     "bs-mpiracmo-son-tas-cru.nc"    
[17] "bs-noresmremo-djf-tas-cru.nc"   "bs-noresmremo-mam-tas-cru.nc"  
[19] "bs-noresmremo-jja-tas-cru.nc"   "bs-noresmremo-son-tas-cru.nc" 

If anyone know how to sort with this sequence please help me.
Thanks in advance

答案1

得分: 2

以下是您要翻译的代码部分:

你可以提取字符串的这一部分并与你的顺序进行匹配,即

my_order <- c("djf", "mam", "jja", "son")
i1 <- gsub(".*-(.*)-tas.*", "\", file_all)
i2 <- gsub("bs-([^-]*)-.*", "\", file_all)

file_all[order(i2, match(i1, my_order))]

一些替代方法包括:

使用 strsplit。这可以提高效率,因为它不使用正则表达式。

split_files <- strsplit(strings, "-")
i3 <- sapply(split_files, "[[", 3)
i4 <- sapply(split_files, "[[", 2)
file_all[order(i4, match(i3, my_order))]
英文:

You can extract that part of the string and match it with your order, i.e.

my_order &lt;- c(&quot;djf&quot;, &quot;mam&quot;, &quot;jja&quot;, &quot;son&quot;)
i1 &lt;- gsub(&quot;.*-(.*)-tas.*&quot;, &quot;\&quot;, file_all)
i2 &lt;- gsub(&quot;bs-([^-]*)-.*&quot;, &quot;\&quot;, file_all)

file_all[order(i2, match(i1, my_order))]

 [1] &quot;bs-cnrmaladin-djf-tas-cru.nc&quot;   &quot;bs-cnrmaladin-mam-tas-cru.nc&quot;   &quot;bs-cnrmaladin-jja-tas-cru.nc&quot;   &quot;bs-cnrmaladin-son-tas-cru.nc&quot;  
 [5] &quot;bs-ecearthcclm-djf-tas-cru.nc&quot;  &quot;bs-ecearthcclm-mam-tas-cru.nc&quot;  &quot;bs-ecearthcclm-jja-tas-cru.nc&quot;  &quot;bs-ecearthcclm-son-tas-cru.nc&quot; 
 [9] &quot;bs-hadgemhirham-djf-tas-cru.nc&quot; &quot;bs-hadgemhirham-mam-tas-cru.nc&quot; &quot;bs-hadgemhirham-jja-tas-cru.nc&quot; &quot;bs-hadgemhirham-son-tas-cru.nc&quot;
[13] &quot;bs-mpiracmo-djf-tas-cru.nc&quot;     &quot;bs-mpiracmo-mam-tas-cru.nc&quot;     &quot;bs-mpiracmo-jja-tas-cru.nc&quot;     &quot;bs-mpiracmo-son-tas-cru.nc&quot;    
[17] &quot;bs-noresmremo-djf-tas-cru.nc&quot;   &quot;bs-noresmremo-mam-tas-cru.nc&quot;   &quot;bs-noresmremo-jja-tas-cru.nc&quot;   &quot;bs-noresmremo-son-tas-cru.nc&quot;  

A few alternatives would be:

To use strsplit. This could improve efficiency as it doesn't use regular expressions.

split_files &lt;- strsplit(strings, &quot;-&quot;)
i3 &lt;- sapply(split_files, &quot;[[&quot;, 3)
i4 &lt;- sapply(split_files, &quot;[[&quot;, 2)
file_all[order(i4, match(i3, my_order))]

答案2

得分: 1

A little bit verbose, but you could use a tidyverse based solution:

library(dplyr)
library(tidyr)

file_all %>% 
  data.frame(model = .) %>% 
  separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>% 
  mutate(model3 = match(model3, my_order)) %>% 
  arrange(model2, model3) %>% 
  pull(modelmodel)

This returns:

 [1] "bs-cnrmaladin-djf-tas-cru.nc"   "bs-cnrmaladin-mam-tas-cru.nc"   "bs-cnrmaladin-jja-tas-cru.nc"  
 [4] "bs-cnrmaladin-son-tas-cru.nc"   "bs-ecearthcclm-djf-tas-cru.nc"  "bs-ecearthcclm-mam-tas-cru.nc" 
 [7] "bs-ecearthcclm-jja-tas-cru.nc"  "bs-ecearthcclm-son-tas-cru.nc"  "bs-hadgemhirham-djf-tas-cru.nc"
[10] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc"     "bs-mpiracmo-mam-tas-cru.nc"     "bs-mpiracmo-jja-tas-cru.nc"    
[16] "bs-mpiracmo-son-tas-cru.nc"     "bs-noresmremo-djf-tas-cru.nc"   "bs-noresmremo-mam-tas-cru.nc"  
[19] "bs-noresmremo-jja-tas-cru.nc"   "bs-noresmremo-son-tas-cru.nc"

Data:

file_all <- c("bs-cnrmaladin-djf-tas-cru.nc", "bs-cnrmaladin-jja-tas-cru.nc",
              "bs-cnrmaladin-mam-tas-cru.nc", "bs-cnrmaladin-son-tas-cru.nc",
              "bs-ecearthcclm-djf-tas-cru.nc", "bs-ecearthcclm-jja-tas-cru.nc",
              "bs-ecearthcclm-mam-tas-cru.nc", "bs-ecearthcclm-son-tas-cru.nc",
              "bs-hadgemhirham-djf-tas-cru.nc", "bs-hadgemhirham-jja-tas-cru.nc",
              "bs-hadgemhirham-mam-tas-cru.nc", "bs-hadgemhirham-son-tas-cru.nc",
              "bs-mpiracmo-djf-tas-cru.nc", "bs-mpiracmo-jja-tas-cru.nc",
              "bs-mpiracmo-mam-tas-cru.nc", "bs-mpiracmo-son-tas-cru.nc",
              "bs-noresmremo-djf-tas-cru.nc", "bs-noresmremo-jja-tas-cru.nc",
              "bs-noresmremo-mam-tas-cru.nc", "bs-noresmremo-son-tas-cru.nc")

my_order <- c("djf", "mam", "jja", "son")

Benchmark:

library(microbenchmark)

my_order <- c("djf", "mam", "jja", "son")
my_files_2 <- sample(file_all, 100000, replace = TRUE)

microbenchmark(
  sotos = my_files_2[order(match(gsub(".*-(.*)-tas.*", "\", my_files_2), my_order))],
  sotos_2 = { split_files <- strsplit(my_files_2, "-")
  i3 <- unlist(lapply(split_files, "[[", 3))
  i4 <- unlist(lapply(split_files, "[[", 2))
  my_files_2[order(i4, match(i3, my_order))] },
  sotos_3 = { split_files <- strsplit(my_files_2, "-", fixed = TRUE)
  i3 <- unlist(lapply(split_files, "[[", 3))
  i4 <- unlist(lapply(split_files, "[[", 2))
  my_files_2[order(i4, match(i3, my_order))] },
  alt = my_files_2[unlist(lapply(my_order, \(n) which(grepl(n, my_files_2))))],
  tidyverse = my_files_2 %>%
    data.frame(model = .) %>%
    separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
    mutate(model3 = match(model3, my_order)) %>%
    arrange(model2, model3) %>%
    pull(modelmodel)
)

Benchmark results based on the size of my_list_2 are also provided.

英文:

A little bit verbose, but you could use a tidyverse based solution:

library(dplyr)
library(tidyr)

file_all %&gt;% 
  data.frame(model = .) %&gt;% 
  separate_wider_delim(model, delim = &quot;-&quot;, names_sep = &quot;&quot;, cols_remove = FALSE) %&gt;% 
  mutate(model3 = match(model3, my_order)) %&gt;% 
  arrange(model2, model3) %&gt;% 
  pull(modelmodel)

This returns

 [1] &quot;bs-cnrmaladin-djf-tas-cru.nc&quot;   &quot;bs-cnrmaladin-mam-tas-cru.nc&quot;   &quot;bs-cnrmaladin-jja-tas-cru.nc&quot;  
 [4] &quot;bs-cnrmaladin-son-tas-cru.nc&quot;   &quot;bs-ecearthcclm-djf-tas-cru.nc&quot;  &quot;bs-ecearthcclm-mam-tas-cru.nc&quot; 
 [7] &quot;bs-ecearthcclm-jja-tas-cru.nc&quot;  &quot;bs-ecearthcclm-son-tas-cru.nc&quot;  &quot;bs-hadgemhirham-djf-tas-cru.nc&quot;
[10] &quot;bs-hadgemhirham-mam-tas-cru.nc&quot; &quot;bs-hadgemhirham-jja-tas-cru.nc&quot; &quot;bs-hadgemhirham-son-tas-cru.nc&quot;
[13] &quot;bs-mpiracmo-djf-tas-cru.nc&quot;     &quot;bs-mpiracmo-mam-tas-cru.nc&quot;     &quot;bs-mpiracmo-jja-tas-cru.nc&quot;    
[16] &quot;bs-mpiracmo-son-tas-cru.nc&quot;     &quot;bs-noresmremo-djf-tas-cru.nc&quot;   &quot;bs-noresmremo-mam-tas-cru.nc&quot;  
[19] &quot;bs-noresmremo-jja-tas-cru.nc&quot;   &quot;bs-noresmremo-son-tas-cru.nc&quot;  

Data

file_all &lt;- c(&quot;bs-cnrmaladin-djf-tas-cru.nc&quot;   ,&quot;bs-cnrmaladin-jja-tas-cru.nc&quot;  ,
              &quot;bs-cnrmaladin-mam-tas-cru.nc&quot;   ,&quot;bs-cnrmaladin-son-tas-cru.nc&quot;  ,
              &quot;bs-ecearthcclm-djf-tas-cru.nc&quot;  ,&quot;bs-ecearthcclm-jja-tas-cru.nc&quot; ,
              &quot;bs-ecearthcclm-mam-tas-cru.nc&quot;  ,&quot;bs-ecearthcclm-son-tas-cru.nc&quot; ,
              &quot;bs-hadgemhirham-djf-tas-cru.nc&quot; ,&quot;bs-hadgemhirham-jja-tas-cru.nc&quot;,
              &quot;bs-hadgemhirham-mam-tas-cru.nc&quot; ,&quot;bs-hadgemhirham-son-tas-cru.nc&quot;,
              &quot;bs-mpiracmo-djf-tas-cru.nc&quot;     ,&quot;bs-mpiracmo-jja-tas-cru.nc&quot;    ,
              &quot;bs-mpiracmo-mam-tas-cru.nc&quot;     ,&quot;bs-mpiracmo-son-tas-cru.nc&quot;    ,
              &quot;bs-noresmremo-djf-tas-cru.nc&quot;   ,&quot;bs-noresmremo-jja-tas-cru.nc&quot;  ,
              &quot;bs-noresmremo-mam-tas-cru.nc&quot;   ,&quot;bs-noresmremo-son-tas-cru.nc&quot; )

my_order &lt;- c(&quot;djf&quot;, &quot;mam&quot;, &quot;jja&quot;, &quot;son&quot;)

Benchmark

Since I was curious I did a small benchmark of @Sotos answer vs. a variation of this answer and the tidyverse based solution:

library(microbenchmark)

my_order &lt;- c(&quot;djf&quot;, &quot;mam&quot;, &quot;jja&quot;, &quot;son&quot;)
my_files_2 &lt;- sample(file_all, 100000, replace = TRUE)

microbenchmark(
  sotos = my_files_2[order(match(gsub(&quot;.*-(.*)-tas.*&quot;, &quot;\&quot;, my_files_2), my_order))],
  sotos_2 = { split_files &lt;- strsplit(my_files_2, &quot;-&quot;)
  i3 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 3))
  i4 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 2))
  my_files_2[order(i4, match(i3, my_order))] },
  sotos_3 = { split_files &lt;- strsplit(my_files_2, &quot;-&quot;, fixed = TRUE)
  i3 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 3))
  i4 &lt;- unlist(lapply(split_files, &quot;[[&quot;, 2))
  my_files_2[order(i4, match(i3, my_order))] },
  alt = my_files_2[unlist(lapply(my_order, \(n) which(grepl(n, my_files_2))))],
  tidyverse = my_files_2 %&gt;% 
    data.frame(model = .) %&gt;% 
    separate_wider_delim(model, delim = &quot;-&quot;, names_sep = &quot;&quot;, cols_remove = FALSE) %&gt;% 
    mutate(model3 = match(model3, my_order)) %&gt;% 
    arrange(model2, model3) %&gt;% 
    pull(modelmodel)
)

The result surprised me quite a bit

#&gt; Unit: milliseconds
#&gt;       expr      min       lq      mean    median        uq      max neval
#&gt;      sotos 126.5779 127.5660 129.03582 128.17515 128.83190 199.8997   100
#&gt;    sotos_2 403.7318 412.5335 430.80871 419.50455 433.35235 511.1252   100
#&gt;    sotos_3 220.1570 228.3046 244.84893 235.50230 246.01510 314.8171   100
#&gt;        alt 113.9459 114.7081 115.86338 115.50140 116.37555 121.4351   100
#&gt;  tidyverse  75.5209  83.2439  93.82626  85.46735  92.48615 183.7602   100

Here are a few benchmark results based on the size of my_list_2:

my_files_2 &lt;- sample(file_all, 100, replace = TRUE)
#&gt; Unit: microseconds
#&gt;       expr    min      lq     mean  median      uq     max neval
#&gt;      sotos  155.6  162.95  175.719  173.70  184.05   263.3   100
#&gt;    sotos_2  399.6  416.35  446.641  427.35  448.80   705.4   100
#&gt;    sotos_3  217.6  229.60  278.047  237.20  252.25  3846.8   100
#&gt;        alt  139.6  150.00  169.377  154.75  160.30  1452.0   100
#&gt;  tidyverse 7912.6 8265.80 8927.205 8529.85 8987.45 14056.3   100


my_files_2 &lt;- sample(file_all, 1000, replace = TRUE)
#&gt; Unit: milliseconds
#&gt;       expr    min      lq     mean  median      uq     max neval
#&gt;      sotos 1.2786 1.31485 1.368087 1.33890 1.37390  1.9579   100
#&gt;    sotos_2 3.6715 3.77725 3.849989 3.83145 3.90755  4.3005   100
#&gt;    sotos_3 1.8969 1.94865 2.004843 1.97525 2.01915  2.4871   100
#&gt;        alt 1.1500 1.18695 1.229627 1.20130 1.22640  2.4962   100
#&gt;  tidyverse 8.6329 8.99985 9.710873 9.21335 9.63085 14.6684   100


my_files_2 &lt;- sample(file_all, 10000, replace = TRUE)
#&gt; Unit: milliseconds
#&gt;       expr     min       lq     mean   median       uq     max neval
#&gt;      sotos 12.5224 12.81450 12.92100 12.89505 13.00510 13.5006   100
#&gt;    sotos_2 37.5119 38.09420 38.98338 38.34025 38.76710 51.7555   100
#&gt;    sotos_3 19.3073 19.70905 20.63162 19.88925 20.29015 25.1160   100
#&gt;        alt 11.2854 11.54560 11.74392 11.63175 11.72850 15.7317   100
#&gt;  tidyverse 14.4663 15.03035 16.42199 15.27875 15.49390 83.9643   100

At a sample size of ~20000 the tidyverse solution seems to take the lead.

huangapple
  • 本文由 发表于 2023年4月6日 20:02:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75949307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定