英文:
How to read the latest dataset using fread in R?
问题
I am trying to:
- 从我的数据文件夹中读取最新的数据。我有相同的数据,但时间戳不同。
根据评论中的建议进行更新
例如:
如您所见,我的第一批数据是在2023年05月24日10:48生成的
log_metrics202305241048
A tibble: 10 × 9
App name
独立访问者的总浏览次数¹ 记录的总浏览次数² 独立访问者的总浏览次数³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21
ℹ 缩写名称:¹独立访问者的总浏览次数,²记录的总浏览次数,³独立访问者的总浏览次数
ℹ 还有5个变量:记录的总浏览次数,超过30秒的日志 <dbl>,
每个应用程序的平均停留时间(分钟)按日志计算 <dbl>,最长持续时间(分钟) <time>,
最短持续时间(分钟) <time>,一个小时内的总浏览次数 <dbl>
>
这是我的第二批数据,显然是在2023年05月204日10:51生成的
log_metrics202305241051
A tibble: 10 × 9
App name
独立访问者的总浏览次数¹ 记录的总浏览次数² 独立访问者的总浏览次数³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21
ℹ 缩写名称:¹独立访问者的总浏览次数,²记录的总浏览次数,³独立访问者的总浏览次数
ℹ 还有5个变量:记录的总浏览次数,超过30秒的日志 <dbl>,
每个应用程序的平均停留时间(分钟)按日志计算 <dbl>,最长持续时间(分钟) <time>,
最短持续时间(分钟) <time>,一个小时内的总浏览次数 <dbl>
现在,我想要读取最新的数据集,即第一个数据集,在10:08生成。然而,作为背景,我想要生成月度数据集,但因为我试图在未来做到这一点,所以现在我将使用日期和时间戳。因此,我认为最好同时使用日期和时间,以便被识别和选择。
我正在使用data.table::fread来读取数据,因为它读取得非常快。
现在,我想要编写这段代码data.table::fread('data/generated_metrics/log_visits_ with the latest date.csv')
来读取位于data/generated_metrics/文件夹中的最新数据集,其中包含所有这些时间戳的log_visists数据。是否有一种可以使用fread函数以编程方式执行此操作的方法,以读取最新的数据集?我也愿意尝试用于读取最新数据的替代函数。
英文:
I am trying to:
- read the latest data from my data folder. I have the same data but at different time stamps.
UPDATED ACCORDING TO THE SUGGESTION IN THE COMMENT
For example:
As you can see my first data was produced on date 20230524 at 10:48
log_metrics202305241048
# A tibble: 10 × 9
`App name` Total number of view…¹ Total number of view…² Total number of view…³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21
# ℹ abbreviated names: ¹`Total number of views by unique visitors`,
# ²`Total number of views by number of logs`,
# ³`Total number of views by unique visitor, over 30 seconds`
# ℹ 5 more variables: `Total number of views by logs, over 30 seconds` <dbl>,
# `Average time spent per app (minutes) by logs` <dbl>, `Maxium duration (in minutes)` <time>,
# `Minimum duration (in minutes` <time>, `Total number of views over an hour` <dbl>
>
This is my second dataset which clearly was produced on 2023 05 204 at 10:51
log_metrics202305241051
# A tibble: 10 × 9
`App name` Total number of view…¹ Total number of view…² Total number of view…³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21
# ℹ abbreviated names: ¹`Total number of views by unique visitors`,
# ²`Total number of views by number of logs`,
# ³`Total number of views by unique visitor, over 30 seconds`
# ℹ 5 more variables: `Total number of views by logs, over 30 seconds` <dbl>,
# `Average time spent per app (minutes) by logs` <dbl>, `Maxium duration (in minutes)` <time>,
# `Minimum duration (in minutes` <time>, `Total number of views over an hour` <dbl>
Now, I want to read the latest dataset, which is the first one, produced on 10:08. However, as a background, I want to produce monthly datasets, but because I am trying to do this in the future, I will now use the datetime stamps. Thus, I believe it is better to use both, date and time, to be be recognised and picked.
I am using the data.table::fread to read the data because this it reads it very quick.
Now I want to program this code data.table::fread('data/generated_metrics/log_visits_ with the latest date.csv')
to read the latest dataset found in data/generated_metrics/ folder where all these log_visists data stamped are sitting.
Is there a way to programmatically do this using fread function to read the latest data set? I am also open for alternative functions for reading my latest data.
答案1
得分: 2
您可以使用 as.POSIXct
来解析文件名。编辑:因为您添加了子目录的使用,我们需要稍微扩展 list.files
,然后使用 basename
仅处理文件名...
files <- list.files(path = "my/data/directory", pattern ="log_metrics.*",
full.names = TRUE)
# files <- c("my/data/directory/log_metrics24_May_2023_10_03", "my/data/directory/log_metrics24_May_2023_10_08")
basename(files)
# [1] "log_metrics24_May_2023_10_03" "log_metrics24_May_2023_10_08"
as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"
fread(files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))])
如果您的文件名前缀不同,一个可能有效的方法是删除第一个数字之前的所有内容(请注意不再使用 basename
)。
sub("^[^0-9]*", "", files)
# [1] "24_May_2023_10_03" "24_May_2023_10_08"
as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"
如果您不熟悉 as.POSIXct
使用的 %
代码,请参阅 ?strptime
。
英文:
You can use as.POSIXct
to parse the filenames. Edit: since you added the use of a subdir, we need to expand list.files
a little, and then use basename
to work on just the filenames ...
files <- list.files(path = "my/data/directory", pattern ="log_metrics.*",
full.names = TRUE)
# files <- c("my/data/directory/log_metrics24_May_2023_10_03", "my/data/directory/log_metrics24_May_2023_10_08")
basename(files)
# [1] "log_metrics24_May_2023_10_03" "log_metrics24_May_2023_10_08"
as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"
fread(files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))])
If your filename prefix varies, one hack (that may or may not work) is to strip everything leading up to the first number (note no more use of basename
).
sub("^[^0-9]*", "", files)
# [1] "24_May_2023_10_03" "24_May_2023_10_08"
as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"
If you aren't familiar with as.POSIXct
's use of %
-codes, see ?strptime
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论