2023年5月24日 18:32:50go评论102阅读模式

英文:

How to read the latest dataset using fread in R?

问题

I am trying to:

从我的数据文件夹中读取最新的数据。我有相同的数据，但时间戳不同。

根据评论中的建议进行更新

例如：

如您所见，我的第一批数据是在2023年05月24日10:48生成的

log_metrics202305241048

A tibble: 10 × 9

App name 独立访问者的总浏览次数¹ 记录的总浏览次数² 独立访问者的总浏览次数³
<chr> <dbl> <dbl> <dbl>
1 animals_to_groups 22 21 19
2 cage_randomiser 34 33 30
3 combo_cor 19 18 18
4 crispr_screen_viz 23 81 21
5 dep_map_bem 5 12 5
6 dtp_browser_prod 7 6 7
7 flat 67 66 67
8 growth_rate_explorer 42 47 41
9 moprospector 3 3 2
10 translatability_single_gene 22 43 21

ℹ 缩写名称：¹独立访问者的总浏览次数，²记录的总浏览次数，³独立访问者的总浏览次数

ℹ 还有5个变量：记录的总浏览次数，超过30秒的日志 <dbl>，

每个应用程序的平均停留时间（分钟）按日志计算 <dbl>，最长持续时间（分钟） <time>，

最短持续时间（分钟） <time>，一个小时内的总浏览次数 <dbl>

这是我的第二批数据，显然是在2023年05月204日10:51生成的

log_metrics202305241051

A tibble: 10 × 9

ℹ 缩写名称：¹独立访问者的总浏览次数，²记录的总浏览次数，³独立访问者的总浏览次数

ℹ 还有5个变量：记录的总浏览次数，超过30秒的日志 <dbl>，

每个应用程序的平均停留时间（分钟）按日志计算 <dbl>，最长持续时间（分钟） <time>，

最短持续时间（分钟） <time>，一个小时内的总浏览次数 <dbl>

现在，我想要读取最新的数据集，即第一个数据集，在10:08生成。然而，作为背景，我想要生成月度数据集，但因为我试图在未来做到这一点，所以现在我将使用日期和时间戳。因此，我认为最好同时使用日期和时间，以便被识别和选择。

我正在使用data.table::fread来读取数据，因为它读取得非常快。

现在，我想要编写这段代码data.table::fread('data/generated_metrics/log_visits_ with the latest date.csv')来读取位于data/generated_metrics/文件夹中的最新数据集，其中包含所有这些时间戳的log_visists数据。是否有一种可以使用fread函数以编程方式执行此操作的方法，以读取最新的数据集？我也愿意尝试用于读取最新数据的替代函数。

英文:

I am trying to:

read the latest data from my data folder. I have the same data but at different time stamps.

UPDATED ACCORDING TO THE SUGGESTION IN THE COMMENT

For example:

As you can see my first data was produced on date 20230524 at 10:48

log_metrics202305241048
# A tibble: 10 &#215; 9
   `App name`                  Total number of view…&#185; Total number of view…&#178; Total number of view…&#179;
   &lt;chr&gt;                                        &lt;dbl&gt;                  &lt;dbl&gt;                  &lt;dbl&gt;
 1 animals_to_groups                               22                     21                     19
 2 cage_randomiser                                 34                     33                     30
 3 combo_cor                                       19                     18                     18
 4 crispr_screen_viz                               23                     81                     21
 5 dep_map_bem                                      5                     12                      5
 6 dtp_browser_prod                                 7                      6                      7
 7 flat                                            67                     66                     67
 8 growth_rate_explorer                            42                     47                     41
 9 moprospector                                     3                      3                      2
10 translatability_single_gene                     22                     43                     21
# ℹ abbreviated names: &#185;`Total number of views by unique visitors`,
#   &#178;`Total number of views by number of logs`,
#   &#179;`Total number of views by unique visitor, over 30 seconds`
# ℹ 5 more variables: `Total number of views by logs, over 30 seconds` &lt;dbl&gt;,
#   `Average time spent per app (minutes) by logs` &lt;dbl&gt;, `Maxium duration (in minutes)` &lt;time&gt;,
#   `Minimum duration (in minutes` &lt;time&gt;, `Total number of views over an hour` &lt;dbl&gt;
&gt;

This is my second dataset which clearly was produced on 2023 05 204 at 10:51

   log_metrics202305241051
# A tibble: 10 &#215; 9
   `App name`                  Total number of view…&#185; Total number of view…&#178; Total number of view…&#179;
   &lt;chr&gt;                                        &lt;dbl&gt;                  &lt;dbl&gt;                  &lt;dbl&gt;
 1 animals_to_groups                               22                     21                     19
 2 cage_randomiser                                 34                     33                     30
 3 combo_cor                                       19                     18                     18
 4 crispr_screen_viz                               23                     81                     21
 5 dep_map_bem                                      5                     12                      5
 6 dtp_browser_prod                                 7                      6                      7
 7 flat                                            67                     66                     67
 8 growth_rate_explorer                            42                     47                     41
 9 moprospector                                     3                      3                      2
10 translatability_single_gene                     22                     43                     21
# ℹ abbreviated names: &#185;`Total number of views by unique visitors`,
#   &#178;`Total number of views by number of logs`,
#   &#179;`Total number of views by unique visitor, over 30 seconds`
# ℹ 5 more variables: `Total number of views by logs, over 30 seconds` &lt;dbl&gt;,
#   `Average time spent per app (minutes) by logs` &lt;dbl&gt;, `Maxium duration (in minutes)` &lt;time&gt;,
#   `Minimum duration (in minutes` &lt;time&gt;, `Total number of views over an hour` &lt;dbl&gt;

Now, I want to read the latest dataset, which is the first one, produced on 10:08. However, as a background, I want to produce monthly datasets, but because I am trying to do this in the future, I will now use the datetime stamps. Thus, I believe it is better to use both, date and time, to be be recognised and picked.

I am using the data.table::fread to read the data because this it reads it very quick.

Now I want to program this code data.table::fread('data/generated_metrics/log_visits_ with the latest date.csv') to read the latest dataset found in data/generated_metrics/ folder where all these log_visists data stamped are sitting.

Is there a way to programmatically do this using fread function to read the latest data set? I am also open for alternative functions for reading my latest data.

答案1

得分: 2

您可以使用 as.POSIXct 来解析文件名。编辑：因为您添加了子目录的使用，我们需要稍微扩展 list.files，然后使用 basename 仅处理文件名...

files <- list.files(path = "my/data/directory", pattern  ="log_metrics.*",
                    full.names = TRUE)
# files <- c("my/data/directory/log_metrics24_May_2023_10_03", "my/data/directory/log_metrics24_May_2023_10_08")
basename(files)
# [1] "log_metrics24_May_2023_10_03" "log_metrics24_May_2023_10_08"
as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"
fread(files[which.max(as.POSIXct(basename(files), format = "log_metrics%d_%b_%Y_%H_%M"))])

如果您的文件名前缀不同，一个可能有效的方法是删除第一个数字之前的所有内容（请注意不再使用 basename）。

sub("^[^0-9]*", "", files)
# [1] "24_May_2023_10_03" "24_May_2023_10_08"
as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M")
# [1] "2023-05-24 10:03:00 EDT" "2023-05-24 10:08:00 EDT"
files[which.max(as.POSIXct(sub("^[^0-9]*", "", files), format = "%d_%b_%Y_%H_%M"))]
# [1] "my/data/directory/log_metrics24_May_2023_10_08"

如果您不熟悉 as.POSIXct 使用的 % 代码，请参阅 ?strptime。

英文:

You can use as.POSIXct to parse the filenames. Edit: since you added the use of a subdir, we need to expand list.files a little, and then use basename to work on just the filenames ...

files &lt;- list.files(path = &quot;my/data/directory&quot;, pattern  =&quot;log_metrics.*&quot;,
                    full.names = TRUE)
# files &lt;- c(&quot;my/data/directory/log_metrics24_May_2023_10_03&quot;, &quot;my/data/directory/log_metrics24_May_2023_10_08&quot;)
basename(files)
# [1] &quot;log_metrics24_May_2023_10_03&quot; &quot;log_metrics24_May_2023_10_08&quot;
as.POSIXct(basename(files), format = &quot;log_metrics%d_%b_%Y_%H_%M&quot;)
# [1] &quot;2023-05-24 10:03:00 EDT&quot; &quot;2023-05-24 10:08:00 EDT&quot;
files[which.max(as.POSIXct(basename(files), format = &quot;log_metrics%d_%b_%Y_%H_%M&quot;))]
# [1] &quot;my/data/directory/log_metrics24_May_2023_10_08&quot;
fread(files[which.max(as.POSIXct(basename(files), format = &quot;log_metrics%d_%b_%Y_%H_%M&quot;))])

If your filename prefix varies, one hack (that may or may not work) is to strip everything leading up to the first number (note no more use of basename).

sub(&quot;^[^0-9]*&quot;, &quot;&quot;, files)
# [1] &quot;24_May_2023_10_03&quot; &quot;24_May_2023_10_08&quot;
as.POSIXct(sub(&quot;^[^0-9]*&quot;, &quot;&quot;, files), format = &quot;%d_%b_%Y_%H_%M&quot;)
# [1] &quot;2023-05-24 10:03:00 EDT&quot; &quot;2023-05-24 10:08:00 EDT&quot;
files[which.max(as.POSIXct(sub(&quot;^[^0-9]*&quot;, &quot;&quot;, files), format = &quot;%d_%b_%Y_%H_%M&quot;))]
# [1] &quot;my/data/directory/log_metrics24_May_2023_10_08&quot;

If you aren't familiar with as.POSIXct's use of %-codes, see ?strptime.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用R中的fread读取最新的数据集？

问题

根据评论中的建议进行更新

A tibble: 10 × 9

ℹ 缩写名称：¹独立访问者的总浏览次数，²记录的总浏览次数，³独立访问者的总浏览次数

ℹ 还有5个变量：记录的总浏览次数，超过30秒的日志 <dbl>，

每个应用程序的平均停留时间（分钟）按日志计算 <dbl>，最长持续时间（分钟） <time>，

最短持续时间（分钟） <time>，一个小时内的总浏览次数 <dbl>

A tibble: 10 × 9

ℹ 缩写名称：¹独立访问者的总浏览次数，²记录的总浏览次数，³独立访问者的总浏览次数

ℹ 还有5个变量：记录的总浏览次数，超过30秒的日志 <dbl>，

每个应用程序的平均停留时间（分钟）按日志计算 <dbl>，最长持续时间（分钟） <time>，

最短持续时间（分钟） <time>，一个小时内的总浏览次数 <dbl>

UPDATED ACCORDING TO THE SUGGESTION IN THE COMMENT

答案1

根据不同的参考日期计算朱利安日。

致命错误与安装 R ‘scalop’ 包时涉及到 “include S.h” 相关。

在R中的表格中超过阈值的颜色值

在R中，通过”S.NO”分组收集数据时，在数据的开头和结尾添加”Age”的NA值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论

问题

根据评论中的建议进行更新

A tibble: 10 × 9

ℹ 缩写名称：¹​独立访问者的总浏览次数，²​记录的总浏览次数，³​独立访问者的总浏览次数

ℹ 还有5个变量：记录的总浏览次数，超过30秒的日志 <dbl>，

每个应用程序的平均停留时间（分钟）按日志计算 <dbl>，最长持续时间（分钟） <time>，

最短持续时间（分钟） <time>，一个小时内的总浏览次数 <dbl>

A tibble: 10 × 9

ℹ 缩写名称：¹​独立访问者的总浏览次数，²​记录的总浏览次数，³​独立访问者的总浏览次数

ℹ 还有5个变量：记录的总浏览次数，超过30秒的日志 <dbl>，

每个应用程序的平均停留时间（分钟）按日志计算 <dbl>，最长持续时间（分钟） <time>，

最短持续时间（分钟） <time>，一个小时内的总浏览次数 <dbl>

UPDATED ACCORDING TO THE SUGGESTION IN THE COMMENT

答案1

发表评论

ℹ 缩写名称：¹独立访问者的总浏览次数，²记录的总浏览次数，³独立访问者的总浏览次数

ℹ 缩写名称：¹独立访问者的总浏览次数，²记录的总浏览次数，³独立访问者的总浏览次数