英文:
Sorting Data with custom sequence in R
问题
我有一个问题,需要按文件名对数据库中的数据进行排序。
首先,我有一个包含20个netcdf数据的文件列表。这是数据列表:
file_all <- list.files(pattern=glob2rx("*.nc"))
> file_all
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
[3] "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
[5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-jja-tas-cru.nc"
[7] "bs-ecearthcclm-mam-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc"
[11] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
[15] "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
[17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-jja-tas-cru.nc"
[19] "bs-noresmremo-mam-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
我需要首先对它进行排序,因为我需要顺序的数据。我想要的顺序是:
djf -> mam -> jja -> son
这是我期望的输出顺序:
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc"
[3] "bs-cnrmaladin-jja-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
[5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
[7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc"
[11] "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc"
[15] "bs-mpiracmo-jja-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
[17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
[19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
如果有人知道如何按照这个顺序排序,请帮助我。
提前感谢。
英文:
I have a problem with sorting data base on the name of the files.
First I have list file containing 20 netcdf data. This is the data list :
file_all <- list.files(pattern=glob2rx("*.nc"))
> file_all
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
[3] "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
[5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-jja-tas-cru.nc"
[7] "bs-ecearthcclm-mam-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc"
[11] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
[15] "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
[17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-jja-tas-cru.nc"
[19] "bs-noresmremo-mam-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
I need to sort it first because I need sequential data. The order that I want is:
djf -> mam -> jja -> son
This is my expected output order:
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc"
[3] "bs-cnrmaladin-jja-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
[5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
[7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc"
[11] "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc"
[15] "bs-mpiracmo-jja-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
[17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
[19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
If anyone know how to sort with this sequence please help me.
Thanks in advance
答案1
得分: 2
以下是您要翻译的代码部分:
你可以提取字符串的这一部分并与你的顺序进行匹配,即
my_order <- c("djf", "mam", "jja", "son")
i1 <- gsub(".*-(.*)-tas.*", "\", file_all)
i2 <- gsub("bs-([^-]*)-.*", "\", file_all)
file_all[order(i2, match(i1, my_order))]
一些替代方法包括:
使用 strsplit
。这可以提高效率,因为它不使用正则表达式。
split_files <- strsplit(strings, "-")
i3 <- sapply(split_files, "[[", 3)
i4 <- sapply(split_files, "[[", 2)
file_all[order(i4, match(i3, my_order))]
英文:
You can extract that part of the string and match it with your order, i.e.
my_order <- c("djf", "mam", "jja", "son")
i1 <- gsub(".*-(.*)-tas.*", "\", file_all)
i2 <- gsub("bs-([^-]*)-.*", "\", file_all)
file_all[order(i2, match(i1, my_order))]
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc" "bs-cnrmaladin-son-tas-cru.nc"
[5] "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc" "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc"
[9] "bs-hadgemhirham-djf-tas-cru.nc" "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc" "bs-mpiracmo-son-tas-cru.nc"
[17] "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc" "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
A few alternatives would be:
To use strsplit
. This could improve efficiency as it doesn't use regular expressions.
split_files <- strsplit(strings, "-")
i3 <- sapply(split_files, "[[", 3)
i4 <- sapply(split_files, "[[", 2)
file_all[order(i4, match(i3, my_order))]
答案2
得分: 1
A little bit verbose, but you could use a tidyverse
based solution:
library(dplyr)
library(tidyr)
file_all %>%
data.frame(model = .) %>%
separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
mutate(model3 = match(model3, my_order)) %>%
arrange(model2, model3) %>%
pull(modelmodel)
This returns:
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
[4] "bs-cnrmaladin-son-tas-cru.nc" "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
[7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc" "bs-hadgemhirham-djf-tas-cru.nc"
[10] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
[16] "bs-mpiracmo-son-tas-cru.nc" "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
[19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
Data:
file_all <- c("bs-cnrmaladin-djf-tas-cru.nc", "bs-cnrmaladin-jja-tas-cru.nc",
"bs-cnrmaladin-mam-tas-cru.nc", "bs-cnrmaladin-son-tas-cru.nc",
"bs-ecearthcclm-djf-tas-cru.nc", "bs-ecearthcclm-jja-tas-cru.nc",
"bs-ecearthcclm-mam-tas-cru.nc", "bs-ecearthcclm-son-tas-cru.nc",
"bs-hadgemhirham-djf-tas-cru.nc", "bs-hadgemhirham-jja-tas-cru.nc",
"bs-hadgemhirham-mam-tas-cru.nc", "bs-hadgemhirham-son-tas-cru.nc",
"bs-mpiracmo-djf-tas-cru.nc", "bs-mpiracmo-jja-tas-cru.nc",
"bs-mpiracmo-mam-tas-cru.nc", "bs-mpiracmo-son-tas-cru.nc",
"bs-noresmremo-djf-tas-cru.nc", "bs-noresmremo-jja-tas-cru.nc",
"bs-noresmremo-mam-tas-cru.nc", "bs-noresmremo-son-tas-cru.nc")
my_order <- c("djf", "mam", "jja", "son")
Benchmark:
library(microbenchmark)
my_order <- c("djf", "mam", "jja", "son")
my_files_2 <- sample(file_all, 100000, replace = TRUE)
microbenchmark(
sotos = my_files_2[order(match(gsub(".*-(.*)-tas.*", "\", my_files_2), my_order))],
sotos_2 = { split_files <- strsplit(my_files_2, "-")
i3 <- unlist(lapply(split_files, "[[", 3))
i4 <- unlist(lapply(split_files, "[[", 2))
my_files_2[order(i4, match(i3, my_order))] },
sotos_3 = { split_files <- strsplit(my_files_2, "-", fixed = TRUE)
i3 <- unlist(lapply(split_files, "[[", 3))
i4 <- unlist(lapply(split_files, "[[", 2))
my_files_2[order(i4, match(i3, my_order))] },
alt = my_files_2[unlist(lapply(my_order, \(n) which(grepl(n, my_files_2))))],
tidyverse = my_files_2 %>%
data.frame(model = .) %>%
separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
mutate(model3 = match(model3, my_order)) %>%
arrange(model2, model3) %>%
pull(modelmodel)
)
Benchmark results based on the size of my_list_2
are also provided.
英文:
A little bit verbose, but you could use a tidyverse
based solution:
library(dplyr)
library(tidyr)
file_all %>%
data.frame(model = .) %>%
separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
mutate(model3 = match(model3, my_order)) %>%
arrange(model2, model3) %>%
pull(modelmodel)
This returns
[1] "bs-cnrmaladin-djf-tas-cru.nc" "bs-cnrmaladin-mam-tas-cru.nc" "bs-cnrmaladin-jja-tas-cru.nc"
[4] "bs-cnrmaladin-son-tas-cru.nc" "bs-ecearthcclm-djf-tas-cru.nc" "bs-ecearthcclm-mam-tas-cru.nc"
[7] "bs-ecearthcclm-jja-tas-cru.nc" "bs-ecearthcclm-son-tas-cru.nc" "bs-hadgemhirham-djf-tas-cru.nc"
[10] "bs-hadgemhirham-mam-tas-cru.nc" "bs-hadgemhirham-jja-tas-cru.nc" "bs-hadgemhirham-son-tas-cru.nc"
[13] "bs-mpiracmo-djf-tas-cru.nc" "bs-mpiracmo-mam-tas-cru.nc" "bs-mpiracmo-jja-tas-cru.nc"
[16] "bs-mpiracmo-son-tas-cru.nc" "bs-noresmremo-djf-tas-cru.nc" "bs-noresmremo-mam-tas-cru.nc"
[19] "bs-noresmremo-jja-tas-cru.nc" "bs-noresmremo-son-tas-cru.nc"
Data
file_all <- c("bs-cnrmaladin-djf-tas-cru.nc" ,"bs-cnrmaladin-jja-tas-cru.nc" ,
"bs-cnrmaladin-mam-tas-cru.nc" ,"bs-cnrmaladin-son-tas-cru.nc" ,
"bs-ecearthcclm-djf-tas-cru.nc" ,"bs-ecearthcclm-jja-tas-cru.nc" ,
"bs-ecearthcclm-mam-tas-cru.nc" ,"bs-ecearthcclm-son-tas-cru.nc" ,
"bs-hadgemhirham-djf-tas-cru.nc" ,"bs-hadgemhirham-jja-tas-cru.nc",
"bs-hadgemhirham-mam-tas-cru.nc" ,"bs-hadgemhirham-son-tas-cru.nc",
"bs-mpiracmo-djf-tas-cru.nc" ,"bs-mpiracmo-jja-tas-cru.nc" ,
"bs-mpiracmo-mam-tas-cru.nc" ,"bs-mpiracmo-son-tas-cru.nc" ,
"bs-noresmremo-djf-tas-cru.nc" ,"bs-noresmremo-jja-tas-cru.nc" ,
"bs-noresmremo-mam-tas-cru.nc" ,"bs-noresmremo-son-tas-cru.nc" )
my_order <- c("djf", "mam", "jja", "son")
Benchmark
Since I was curious I did a small benchmark of @Sotos answer vs. a variation of this answer and the tidyverse
based solution:
library(microbenchmark)
my_order <- c("djf", "mam", "jja", "son")
my_files_2 <- sample(file_all, 100000, replace = TRUE)
microbenchmark(
sotos = my_files_2[order(match(gsub(".*-(.*)-tas.*", "\", my_files_2), my_order))],
sotos_2 = { split_files <- strsplit(my_files_2, "-")
i3 <- unlist(lapply(split_files, "[[", 3))
i4 <- unlist(lapply(split_files, "[[", 2))
my_files_2[order(i4, match(i3, my_order))] },
sotos_3 = { split_files <- strsplit(my_files_2, "-", fixed = TRUE)
i3 <- unlist(lapply(split_files, "[[", 3))
i4 <- unlist(lapply(split_files, "[[", 2))
my_files_2[order(i4, match(i3, my_order))] },
alt = my_files_2[unlist(lapply(my_order, \(n) which(grepl(n, my_files_2))))],
tidyverse = my_files_2 %>%
data.frame(model = .) %>%
separate_wider_delim(model, delim = "-", names_sep = "", cols_remove = FALSE) %>%
mutate(model3 = match(model3, my_order)) %>%
arrange(model2, model3) %>%
pull(modelmodel)
)
The result surprised me quite a bit
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> sotos 126.5779 127.5660 129.03582 128.17515 128.83190 199.8997 100
#> sotos_2 403.7318 412.5335 430.80871 419.50455 433.35235 511.1252 100
#> sotos_3 220.1570 228.3046 244.84893 235.50230 246.01510 314.8171 100
#> alt 113.9459 114.7081 115.86338 115.50140 116.37555 121.4351 100
#> tidyverse 75.5209 83.2439 93.82626 85.46735 92.48615 183.7602 100
Here are a few benchmark results based on the size of my_list_2
:
my_files_2 <- sample(file_all, 100, replace = TRUE)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> sotos 155.6 162.95 175.719 173.70 184.05 263.3 100
#> sotos_2 399.6 416.35 446.641 427.35 448.80 705.4 100
#> sotos_3 217.6 229.60 278.047 237.20 252.25 3846.8 100
#> alt 139.6 150.00 169.377 154.75 160.30 1452.0 100
#> tidyverse 7912.6 8265.80 8927.205 8529.85 8987.45 14056.3 100
my_files_2 <- sample(file_all, 1000, replace = TRUE)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> sotos 1.2786 1.31485 1.368087 1.33890 1.37390 1.9579 100
#> sotos_2 3.6715 3.77725 3.849989 3.83145 3.90755 4.3005 100
#> sotos_3 1.8969 1.94865 2.004843 1.97525 2.01915 2.4871 100
#> alt 1.1500 1.18695 1.229627 1.20130 1.22640 2.4962 100
#> tidyverse 8.6329 8.99985 9.710873 9.21335 9.63085 14.6684 100
my_files_2 <- sample(file_all, 10000, replace = TRUE)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> sotos 12.5224 12.81450 12.92100 12.89505 13.00510 13.5006 100
#> sotos_2 37.5119 38.09420 38.98338 38.34025 38.76710 51.7555 100
#> sotos_3 19.3073 19.70905 20.63162 19.88925 20.29015 25.1160 100
#> alt 11.2854 11.54560 11.74392 11.63175 11.72850 15.7317 100
#> tidyverse 14.4663 15.03035 16.42199 15.27875 15.49390 83.9643 100
At a sample size of ~20000
the tidyverse solution seems to take the lead.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论