2023年2月16日 06:44:28go评论70阅读模式

英文:

R Programming - Data Cleaning - DateTime

问题

你好，Stackoverflow社区，

我目前正在处理一个具有日期/时间变量和用于量化某种强度体育活动时间的数值变量的大型数据集。数据集的形式如下：

data_raw <- structure(list(`Bout Start` = c("2/8/2017 9:01:00 AM", "2/8/2017 9:23:00 AM", "2/8/2017 9:42:00 AM", "2/8/2017 11:49:00 AM", "2/8/2017 1:39:00 PM"), `Bout End` = c("2/8/2017 9:12:00 AM", "2/8/2017 9:38:00 AM", "2/8/2017 9:52:00 AM", "2/8/2017 12:05:00 PM", "2/8/2017 1:58:00 PM"), `Time in Bout` = c(11, 15, 10, 16, 19)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

我需要将我的数据集处理成以下形式：

data_processed <- structure(list(Date = structure(c(Date5306 = 17205, Date5307 = 17205, Date5308 = 17205, Date5309 = 17205, Date5310 = 17205), class = "Date"), Hour = structure(c(28800, 32400, 36000, 39600, 43200), class = c("hms", "difftime"), units = "secs"), `Time in Bout (Hourly)` = c(0, 36, 0, 11, 5)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

请问有人能帮助我完成这个转换吗？非常感谢大家的帮助！

英文:

Hello Stackoverflow community,

I am currently working with a large dataset that has Date/Time variable and a numeric variable that quantifies the time spent in physical activity of a certain intensity. The dataset is in the form:

data_raw &lt;- structure(list(`Bout Start` = c(&quot;2/8/2017 9:01:00 AM&quot;, &quot;2/8/2017 9:23:00 AM&quot;, &quot;2/8/2017 9:42:00 AM&quot;, &quot;2/8/2017 11:49:00 AM&quot;, &quot;2/8/2017 1:39:00 PM&quot;), `Bout End` = c(&quot;2/8/2017 9:12:00 AM&quot;, &quot;2/8/2017 9:38:00 AM&quot;, &quot;2/8/2017 9:52:00 AM&quot;, &quot;2/8/2017 12:05:00 PM&quot;, &quot;2/8/2017 1:58:00 PM&quot;),`Time in Bout` = c(11, 15, 10, 16, 19)), row.names = c(NA, -5L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

I require my dataset in the form:

data_processed &lt;- structure(list(Date = structure(c(Date5306 = 17205, Date5307 = 17205, Date5308 = 17205, Date5309 = 17205, Date5310 = 17205), class = &quot;Date&quot;), Hour = structure(c(28800, 32400, 36000, 39600, 43200), class = c(&quot;hms&quot;, &quot;difftime&quot;), units = &quot;secs&quot;), `Time in Bout (Hourly)` = c(0, 36, 0, 11, 5)), row.names = c(NA, -5L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

Could someone please help me do this? Thank you all in advance!

答案1

得分: 1

首先，我们需要将data_raw中的日期时间字符串转换为实际的日期时间变量：

data <- within(data_raw, {
  `Bout Start` <- as.POSIXct(`Bout Start`, format = "%m/%d/%Y %I:%M:%S %p")
  `Bout End`   <- as.POSIXct(`Bout End`,   format = "%m/%d/%Y %I:%M:%S %p")
})

现在您的数据看起来像这样：

data
#> # A tibble: 5 x 3
#>   `Bout Start`        `Bout End`          `Time in Bout`
#>   <dttm>              <dttm>                       <dbl>
#> 1 2017-02-08 09:01:00 2017-02-08 09:12:00             11
#> 2 2017-02-08 09:23:00 2017-02-08 09:38:00             15
#> 3 2017-02-08 09:42:00 2017-02-08 09:52:00             10
#> 4 2017-02-08 11:49:00 2017-02-08 12:05:00             16
#> 5 2017-02-08 13:39:00 2017-02-08 13:58:00             19

现在我们需要创建一个小时向量，用于检查是否有发作事件发生的小时：

times <- seq(as.POSIXct("2017-02-08 08:00"), by = "hour", len = 7)

现在棘手的部分是计算每个小时内发作事件的分钟数：

mins <- rowSums(sapply(seq(nrow(data)), function(i) {
   a <- data$`Bout End`[i] - times
   a <- ifelse(a > 0 & a < 60, a, 0)
   b <- data$`Bout Start`[i] - times
   b <- ifelse(b > 0 & b < 60, b, 0)
   (a - b) %% 60
}))

最后，我们创建一个结果的数据框：

data.frame(Date = as.Date(head(times, -1)),
           Hour = strftime(head(times, -1), "%H:%M:%S"),
           `Time in bout` = head(mins, -1), check.names = FALSE)
#>         Date     Hour Time in bout
#> 1 2017-02-08 08:00:00            0
#> 2 2017-02-08 09:00:00           36
#> 3 2017-02-08 10:00:00            0
#> 4 2017-02-08 11:00:00           11
#> 5 2017-02-08 12:00:00            5
#> 6 2017-02-08 13:00:00           19

^{创建于2023-02-15，使用 reprex v2.0.2}

英文:

First of all, we need to convert your date-time strings in data_raw to actual date-time variables:

data &lt;- within(data_raw, {
  `Bout Start` &lt;- as.POSIXct(`Bout Start`, format = &quot;%m/%d/%Y %I:%M:%S %p&quot;)
  `Bout End`   &lt;- as.POSIXct(`Bout End`,   format = &quot;%m/%d/%Y %I:%M:%S %p&quot;)
})

Now your data looks like this:

data
#&gt; # A tibble: 5 x 3
#&gt;   `Bout Start`        `Bout End`          `Time in Bout`
#&gt;   &lt;dttm&gt;              &lt;dttm&gt;                       &lt;dbl&gt;
#&gt; 1 2017-02-08 09:01:00 2017-02-08 09:12:00             11
#&gt; 2 2017-02-08 09:23:00 2017-02-08 09:38:00             15
#&gt; 3 2017-02-08 09:42:00 2017-02-08 09:52:00             10
#&gt; 4 2017-02-08 11:49:00 2017-02-08 12:05:00             16
#&gt; 5 2017-02-08 13:39:00 2017-02-08 13:58:00             19

We now need to create a vector of hours at which you want to check for bouts:

times &lt;- seq(as.POSIXct(&quot;2017-02-08 08:00&quot;), by = &quot;hour&quot;, len = 7)

The tricky part is now just counting the minutes within each of these hours when there was a bout taking place:

mins &lt;- rowSums(sapply(seq(nrow(data)), function(i) {
   a &lt;- data$`Bout End`[i] - times
   a &lt;- ifelse(a &gt; 0 &amp; a &lt; 60, a, 0)
   b &lt;- data$`Bout Start`[i] - times
   b &lt;- ifelse(b &gt; 0 &amp; b &lt; 60, b, 0)
   (a - b) %% 60
}))

Finally, we create a data frame of the results:

data.frame(Date = as.Date(head(times, -1)),
           Hour = strftime(head(times, -1), &quot;%H:%M:%S&quot;),
           `Time in bout` = head(mins, -1), check.names = FALSE)
#&gt;         Date     Hour Time in bout
#&gt; 1 2017-02-08 08:00:00            0
#&gt; 2 2017-02-08 09:00:00           36
#&gt; 3 2017-02-08 10:00:00            0
#&gt; 4 2017-02-08 11:00:00           11
#&gt; 5 2017-02-08 12:00:00            5
#&gt; 6 2017-02-08 13:00:00           19

<sup>Created on 2023-02-15 with reprex v2.0.2</sup>

答案2

得分: 0

以下是翻译好的内容：

获取日期/小时序列以填充缺失数据的dd1
将跨越小时的时间拆分为相应的小时段，dd2
连接dd1和dd2
同时，将字符串动态转换为日期和完整的小时

library(dplyr) # &gt;= v1.1.0 for &quot;.by&quot; in full_join's summarize and consecutive_id
library(tidyr) # separate and replace_na
library(lubridate) # date functions

dd1 &lt;- tibble(ID = seq(
         ymd_hms(format(first(mdy_hms(data_raw$`Bout Start`)),
         &quot;%Y-%m-%d %H:00:00&quot;)), 
         ymd_hms(format(last(mdy_hms(data_raw$`Bout Start`)), 
         &quot;%Y-%m-%d %H:00:00&quot;)), 3600))

dd1
# A tibble: 5 &#215; 1
  ID                 
  &lt;dttm&gt;             
1 2017-02-08 09:00:00
2 2017-02-08 10:00:00
3 2017-02-08 11:00:00
4 2017-02-08 12:00:00
5 2017-02-08 13:00:00

dd2 &lt;- data_raw %&gt;% 
  mutate(`Bout Start` = mdy_hms(`Bout Start`), 
         `Bout End` = mdy_hms(`Bout End`), 
         is = format(`Bout Start`, &quot;%H&quot;) != format(`Bout End`, &quot;%H&quot;)) %&gt;%
  uncount(is + 1) %&gt;% 
  group_by(grp = consecutive_id(is)) %&gt;% 
  mutate(`Bout Start` = if_else(is &amp; row_number() == 2, 
     ymd_hms(format(first(`Bout End`), &quot;%Y-%m-d %H:00:00&quot;)), `Bout Start`), 
         `Bout End` = if_else(is &amp; row_number() == 1, 
     ymd_hms(format(first(`Bout End`), &quot;%Y-%m-%d %H:00:00&quot;)), `Bout End`), 
         `Time in Bout` = `Bout End` - `Bout Start`, 
         ID = ymd_hms(format(`Bout Start`, &quot;%Y-%m-d %H:00:00&quot;)), is = NULL) %&gt;% 
  ungroup() %&gt;% 
  select(-grp)

dd2
# A tibble: 6 &#215; 4
  `Bout Start`        `Bout End`          `Time in Bout` ID                 
  &lt;dttm&gt;              &lt;dttm&gt;              &lt;drtn&gt;         &lt;dttm&gt;             
1 2017-02-08 09:01:00 2017-02-08 09:12:00 11 mins        2017-02-08 09:00:00
2 2017-02-08 09:23:00 2017-02-08 09:38:00 15 mins        2017-02-08 09:00:00
3 2017-02-08 09:42:00 2017-02-08 09:52:00 10 mins        2017-02-08 09:00:00
4 2017-02-08 11:49:00 2017-02-08 12:00:00 11 mins        2017-02-08 11:00:00
5 2017-02-08 12:00:00 2017-02-08 12:05:00  5 mins        2017-02-08 12:00:00
6 2017-02-08 13:39:00 2017-02-08 13:58:00 19 mins        2017-02-08 13:00:00

连接dd1和dd2，还分离Date和Hour，并用0替换缺失的日期/小时。

full_join(dd1, dd2, multiple=&quot;all&quot;) %&gt;% 
  mutate(`Time in Bout` = replace_na(`Time in Bout`, duration(0))) %&gt;% 
  summarize(`Time in Bout (Hourly)` = sum(`Time in Bout`), .by = ID) %&gt;% 
  separate(ID, c(&quot;Date&quot;, &quot;Hour&quot;), sep=&quot; &quot;)
Joining with `by = join_by(ID)`
# A tibble: 5 &#215; 3
  Date       Hour     `Time in Bout (Hourly)`
  &lt;chr&gt;      &lt;chr&gt;    &lt;drtn&gt;                 
1 2017-02-08 09:00:00 36 mins                
2 2017-02-08 10:00:00  0 mins                
3 2017-02-08 11:00:00 11 mins                
4 2017-02-08 12:00:00  5 mins                
5 2017-02-08 13:00:00 19 mins

英文:

A rather complex task, here is a tidyverse approach

get the sequence of dates/hours to enable filling of missing data, dd1
split hour-spanning times into their corresponding hour bin, dd2
join dd1 and dd2
also, on the fly, convert strings to dates and hours/full hours

Note, it's a dynamic approach; starting and ending hours/dates show up once they appear in the raw data.

library(dplyr) # &gt;= v1.1.0 for &quot;.by&quot; in full_join&#39;s summarize and consecutive_id
library(tidyr) # separate and replace_na
library(lubridate) # date functions
dd1 &lt;- tibble(ID = seq(
ymd_hms(format(first(mdy_hms(data_raw$`Bout Start`)),
&quot;%Y-%m-%d %H:00:00&quot;)), 
ymd_hms(format(last(mdy_hms(data_raw$`Bout Start`)), 
&quot;%Y-%m-%d %H:00:00&quot;)), 3600))
dd1
# A tibble: 5 &#215; 1
ID                 
&lt;dttm&gt;             
1 2017-02-08 09:00:00
2 2017-02-08 10:00:00
3 2017-02-08 11:00:00
4 2017-02-08 12:00:00
5 2017-02-08 13:00:00

dd2 &lt;- data_raw %&gt;% 
mutate(`Bout Start` = mdy_hms(`Bout Start`), 
`Bout End` = mdy_hms(`Bout End`), 
is = format(`Bout Start`, &quot;%H&quot;) != format(`Bout End`, &quot;%H&quot;)) %&gt;%
uncount(is + 1) %&gt;% 
group_by(grp = consecutive_id(is)) %&gt;% 
mutate(`Bout Start` = if_else(is &amp; row_number() == 2, 
ymd_hms(format(first(`Bout End`), &quot;%Y-%m-%d %H:00:00&quot;)), `Bout Start`), 
`Bout End` = if_else(is &amp; row_number() == 1, 
ymd_hms(format(first(`Bout End`), &quot;%Y-%m-%d %H:00:00&quot;)), `Bout End`), 
`Time in Bout` = `Bout End` - `Bout Start`, 
ID = ymd_hms(format(`Bout Start`, &quot;%Y-%m-%d %H:00:00&quot;)), is = NULL) %&gt;% 
ungroup() %&gt;% 
select(-grp)
dd2
# A tibble: 6 &#215; 4
`Bout Start`        `Bout End`          `Time in Bout` ID                 
&lt;dttm&gt;              &lt;dttm&gt;              &lt;drtn&gt;         &lt;dttm&gt;             
1 2017-02-08 09:01:00 2017-02-08 09:12:00 11 mins        2017-02-08 09:00:00
2 2017-02-08 09:23:00 2017-02-08 09:38:00 15 mins        2017-02-08 09:00:00
3 2017-02-08 09:42:00 2017-02-08 09:52:00 10 mins        2017-02-08 09:00:00
4 2017-02-08 11:49:00 2017-02-08 12:00:00 11 mins        2017-02-08 11:00:00
5 2017-02-08 12:00:00 2017-02-08 12:05:00  5 mins        2017-02-08 12:00:00
6 2017-02-08 13:39:00 2017-02-08 13:58:00 19 mins        2017-02-08 13:00:00

Joining dd1 and dd2, also separating Date and Hour and replace NA from missing dates/hours with 0.

full_join(dd1, dd2, multiple=&quot;all&quot;) %&gt;% 
mutate(`Time in Bout` = replace_na(`Time in Bout`, duration(0))) %&gt;% 
summarize(`Time in Bout (Hourly)` = sum(`Time in Bout`), .by = ID) %&gt;% 
separate(ID, c(&quot;Date&quot;, &quot;Hour&quot;), sep=&quot; &quot;)
Joining with `by = join_by(ID)`
# A tibble: 5 &#215; 3
Date       Hour     `Time in Bout (Hourly)`
&lt;chr&gt;      &lt;chr&gt;    &lt;drtn&gt;                 
1 2017-02-08 09:00:00 36 mins                
2 2017-02-08 10:00:00  0 mins                
3 2017-02-08 11:00:00 11 mins                
4 2017-02-08 12:00:00  5 mins                
5 2017-02-08 13:00:00 19 mins

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R编程 – 数据清洗 – 日期时间

问题

答案1

答案2

如何在echarts4r图表周围添加边距以适应轴名称？

合并查找表与数据框。

R ggplot2中的geom_quasirandom在按因子着色时表现奇怪。

Plotly – 更改 x 轴（日期）的范围

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论