如何使用R中的fread读取最新的数据集?

huangapple go评论65阅读模式
英文:

How to read the latest dataset using fread in R?

问题

I am trying to:

  • 从我的数据文件夹中读取最新的数据。我有相同的数据,但时间戳不同。

根据评论中的建议进行更新

例如:

如您所见,我的第一批数据是在2023年05月24日10:48生成的

log_metrics202305241048

A tibble: 10 × 9

App name 独立访问者的总浏览次数¹ 记录的总浏览次数² 独立访问者的总浏览次数³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21

ℹ 缩写名称:¹​独立访问者的总浏览次数,²​记录的总浏览次数,³​独立访问者的总浏览次数

ℹ 还有5个变量:记录的总浏览次数,超过30秒的日志 <dbl>,

每个应用程序的平均停留时间(分钟)按日志计算 <dbl>,最长持续时间(分钟) <time>,

最短持续时间(分钟) <time>,一个小时内的总浏览次数 <dbl>

>

这是我的第二批数据,显然是在2023年05月204日10:51生成的

log_metrics202305241051

A tibble: 10 × 9

App name 独立访问者的总浏览次数¹ 记录的总浏览次数² 独立访问者的总浏览次数³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21

ℹ 缩写名称:¹​独立访问者的总浏览次数,²​记录的总浏览次数,³​独立访问者的总浏览次数

ℹ 还有5个变量:记录的总浏览次数,超过30秒的日志 <dbl>,

每个应用程序的平均停留时间(分钟)按日志计算 <dbl>,最长持续时间(分钟) <time>,

最短持续时间(分钟) <time>,一个小时内的总浏览次数 <dbl>

现在,我想要读取最新的数据集,即第一个数据集,在10:08生成。然而,作为背景,我想要生成月度数据集,但因为我试图在未来做到这一点,所以现在我将使用日期和时间戳。因此,我认为最好同时使用日期和时间,以便被识别和选择。

我正在使用data.table::fread来读取数据,因为它读取得非常快。

现在,我想要编写这段代码data.table::fread('data/generated_metrics/log_visits_ with the latest date.csv')来读取位于data/generated_metrics/文件夹中的最新数据集,其中包含所有这些时间戳的log_visists数据。是否有一种可以使用fread函数以编程方式执行此操作的方法,以读取最新的数据集?我也愿意尝试用于读取最新数据的替代函数。

英文:

I am trying to:

  • read the latest data from my data folder. I have the same data but at different time stamps.

UPDATED ACCORDING TO THE SUGGESTION IN THE COMMENT

For example:

As you can see my first data was produced on date 20230524 at 10:48

log_metrics202305241048
# A tibble: 10 &#215; 9
   `App name`                  Total number of view…&#185; Total number of view…&#178; Total number of view…&#179;
   &lt;chr&gt;                                        &lt;dbl&gt;                  &lt;dbl&gt;                  &lt;dbl&gt;
 1 animals_to_groups                               22                     21                     19
 2 cage_randomiser                                 34                     33                     30
 3 combo_cor                                       19                     18                     18
 4 crispr_screen_viz                               23                     81                     21
 5 dep_map_bem                                      5                     12                      5
 6 dtp_browser_prod                                 7                      6                      7
 7 flat                                            67                     66                     67
 8 growth_rate_explorer                            42                     47                     41
 9 moprospector                                     3                      3                      2
10 translatability_single_gene                     22                     43                     21
# ℹ abbreviated names: &#185;​`Total number of views by unique visitors`,
#   &#178;​`Total number of views by number of logs`,
#   &#179;​`Total number of views by unique visitor, over 30 seconds`
# ℹ 5 more variables: `Total number of views by logs, over 30 seconds` &lt;dbl&gt;,
#   `Average time spent per app (minutes) by logs` &lt;dbl&gt;, `Maxium duration (in minutes)` &lt;time&gt;,
#   `Minimum duration (in minutes` &lt;time&gt;, `Total number of views over an hour` &lt;dbl&gt;
&gt; 

This is my second dataset which clearly was produced on 2023 05 204 at 10:51

   log_metrics202305241051
# A tibble: 10 &#215; 9
   `App name`                  Total number of view…&#185; Total number of view…&#178; Total number of view…&#179;
   &lt;chr&gt;                                        &lt;dbl&gt;                  &lt;dbl&gt;                  &lt;dbl&gt;
 1 animals_to_groups                               22                     21                     19
 2 cage_randomiser                                 34                     33                     30
 3 combo_cor                                       19                     18                     18
 4 crispr_screen_viz                               23                     81                     21
 5 dep_map_bem                                      5                     12                      5
 6 dtp_browser_prod                                 7                      6                      7
 7 flat                                            67                     66                     67
 8 growth_rate_explorer                            42                     47                     41
 9 moprospector                                     3                      3                      2
10 translatability_single_gene                     22                     43                     21
# ℹ abbreviated names: &#185;​`Total number of views by unique visitors`,
#   &#178;​`Total number of views by number of logs`,
#   &#179;​`Total number of views by unique visitor, over 30 seconds`
# ℹ 5 more variables: `Total number of views by logs, over 30 seconds` &lt;dbl&gt;,
#   `Average time spent per app (minutes) by logs` &lt;dbl&gt;, `Maxium duration (in minutes)` &lt;time&gt;,
#   `Minimum duration (in minutes` &lt;time&gt;, `Total number of views over an hour` &lt;dbl&gt;

Now, I want to read the latest dataset, which is the first one, produced on 10:08. However, as a background, I want to produce monthly datasets, but because I am trying to do this in the future, I will now use the datetime stamps. Thus, I believe it is better to use both, date and time, to be be recognised and picked.

I am using the data.table::fread to read the data because this it reads it very quick.

Now I want to program this code data.table::fread(&#39;data/generated_metrics/log_visits_ with the latest date.csv&#39;) to read the latest dataset found in data/generated_metrics/ folder where all these log_visists data stamped are sitting.

Is there a way to programmatically do this using fread function to read the latest data set? I am also open for alternative functions for reading my latest data.

答案1

得分: 2

您可以使用 as.POSIXct 来解析文件名。编辑:因为您添加了子目录的使用,我们需要稍微扩展 list.files,然后使用 basename 仅处理文件名...

files <- list.files(path = "my/data/directory", pattern  ="log_metrics.*",
                    full.names = TRUE)
# files <- c("my/data/directory/log_metrics24_May_2023_10_03", "my/data/directory/log_metrics24_May_2023_10_08")
basename(files)
# [1] "log_metrics24_May_2023_10_03" "log_metrics24_May_2023_10_08"
as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"
fread(files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))])

如果您的文件名前缀不同,一个可能有效的方法是删除第一个数字之前的所有内容(请注意不再使用 basename)。

sub("^[^0-9]*", "", files)
# [1] "24_May_2023_10_03" "24_May_2023_10_08"
as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"

如果您不熟悉 as.POSIXct 使用的 % 代码,请参阅 ?strptime

英文:

You can use as.POSIXct to parse the filenames. Edit: since you added the use of a subdir, we need to expand list.files a little, and then use basename to work on just the filenames ...

files &lt;- list.files(path = &quot;my/data/directory&quot;, pattern  =&quot;log_metrics.*&quot;,
                    full.names = TRUE)
# files &lt;- c(&quot;my/data/directory/log_metrics24_May_2023_10_03&quot;, &quot;my/data/directory/log_metrics24_May_2023_10_08&quot;)
basename(files)
# [1] &quot;log_metrics24_May_2023_10_03&quot; &quot;log_metrics24_May_2023_10_08&quot;
as.POSIXct(basename(files), format = &quot;log_metrics%d_%b_%Y_%H_%M&quot;)
# [1] &quot;2023-05-24 10:03:00 EDT&quot; &quot;2023-05-24 10:08:00 EDT&quot;
files[which.max(as.POSIXct(basename(files), format = &quot;log_metrics%d_%b_%Y_%H_%M&quot;))]
# [1] &quot;my/data/directory/log_metrics24_May_2023_10_08&quot;
fread(files[which.max(as.POSIXct(basename(files), format = &quot;log_metrics%d_%b_%Y_%H_%M&quot;))])

If your filename prefix varies, one hack (that may or may not work) is to strip everything leading up to the first number (note no more use of basename).

sub(&quot;^[^0-9]*&quot;, &quot;&quot;, files)
# [1] &quot;24_May_2023_10_03&quot; &quot;24_May_2023_10_08&quot;
as.POSIXct(sub(&quot;^[^0-9]*&quot;, &quot;&quot;, files), format = &quot;%d_%b_%Y_%H_%M&quot;)
# [1] &quot;2023-05-24 10:03:00 EDT&quot; &quot;2023-05-24 10:08:00 EDT&quot;
files[which.max(as.POSIXct(sub(&quot;^[^0-9]*&quot;, &quot;&quot;, files), format = &quot;%d_%b_%Y_%H_%M&quot;))]
# [1] &quot;my/data/directory/log_metrics24_May_2023_10_08&quot;

If you aren't familiar with as.POSIXct's use of %-codes, see ?strptime.

huangapple
  • 本文由 发表于 2023年5月24日 18:32:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76322568.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定