Grouping by ID, Grouping by time (within 5 minutes of each activity), Find Time Difference of Activity in R

huangapple go评论90阅读模式
英文:

Grouping by ID, Grouping by time (within 5 minutes of each activity), Find Time Difference of Activity in R

问题

Sure, here's the translated content:

有没有一种方法可以让R按ID分组,然后识别时间上的“断裂”,然后计算时间差?
例如:

ID               TIME              
A                12/18/2019 4:45:10 AM
A                12/18/2019 4:45:11 AM
A                12/18/2019 9:06:59 PM               
B                12/18/2019 4:14:13 AM
B                12/18/2019 4:14:14 AM

有人知道如何找出A的时间持续时间吗?请注意,这不是一个difftime问题。我在上午4:45:10进行了某项活动,然后在上午4:45:11又进行了一次。然后我停止了这项活动,并在晚上9:06又重新开始了。是否有代码可以准确地分组ID,然后分组时间,同时检测时间上的巨大间隙,以避免不准确的值?

这不是正确的解决方案。

diff<- data %>%
mutate(diff = difftime(as.POSIXct(Endtime, format = "%m/%d/%Y %I:%M:%S %p"), 
as.POSIXct(Starttime, format = "%m/%d/%Y %I:%M:%S %p"), units = "secs"))

非常感谢任何帮助。
我将继续研究这个问题。谢谢。

英文:

Is there a way for R to group by ID, and then to identify a 'break' in time and then calculate time difference?
For instance:

                ID               TIME              
                A                12/18/2019 4:45:10 AM
                A                12/18/2019 4:45:11 AM
                A                12/18/2019 9:06:59 PM               
                B                12/18/2019 4:14:13 AM
                B                12/18/2019 4:14:14 AM

Does anyone know of a way to find the time duration for A? Notice this is not a difftime problem. I performed a certain activity at 4:45:10 am, then again at 4:45:11 am. I then stopped this activity, and picked back up at 9:06pm. Is there code that can accurately group IDs, and then group time whilst detecting a huge gap in the time to avoid inaccurate values?

This is not the correct solution.

                       diff&lt;- data %&gt;%
                       mutate(diff = difftime(as.POSIXct(Endtime, format = &quot;%m/%d/%Y %I:%M:%S %p&quot;), 
                       as.POSIXct(Starttime, format = &quot;%m/%d/%Y %I:%M:%S %p&quot;), units = &quot;secs&quot;))

Any help is greatly appreciated.
I will continue to research this. Thank you

答案1

得分: 1

这是一种方法:

library(lubridate)
sample_df$TIME = mdy_hms(sample_df$TIME)

sample_df = sample_df %>%
            group_by(ID) %>%
            # lag基本上将下一个值提前一步
            # 这样我们可以减去索引0和索引1、索引1和索引2等……
            mutate(time_diff = TIME - lag(TIME, n = 1, default = NA)) %>%
            mutate(time_diff = replace_na(time_diff, 0))

希望这能给你一些思路。为了理解,可以分为两步进行:

sample_df = sample_df %>%
            group_by(ID) %>%
            mutate(time_lag = dplyr::lag(TIME, n = 1, default = NA)) %>%
            mutate(time_diff = TIME - time_lag) %>%
            mutate(time_diff = replace_na(time_diff, 0))

检查一下 time_lag 列的样子。

英文:

Here's a way to do:

library(lubridate)
sample_df$TIME = mdy_hms(sample_df$TIME)

sample_df = sample_df %&gt;%
            group_by(ID) %&gt;%
            # lag basically bring the next value one step up
            # so we can subtract value at index 0 and index 1, index 1 and index 2 and so on....
            mutate(time_diff = TIME - lag(TIME, n = 1, default = NA)) %&gt;% 
            mutate(time_diff = replace_na(time_diff, 0))

Hope this gives you some idea.
For understanding, do it in two steps:

sample_df = sample_df %&gt;%
            group_by(ID) %&gt;%
            mutate(time_lag = dplyr::lag(TIME, n = 1, default = NA)) %&gt;% 
            mutate(time_diff = TIME - time_lag) %&gt;% 
            mutate(time_diff = replace_na(time_diff, 0))

Check how time_lag column looks.

答案2

得分: 1

就像我之前提到的那样,首先要将你的日期时间转换为日期时间对象;我使用lubridate来实现这一点。由于你想要在某个阈值内保持差异,我保存了一个阈值持续时间为5分钟,你可以根据需要进行更改。如果差异超过了这个阈值,就将它们设为NA

我将差异分为2步进行,这样你可以看到原始差异与去除长时间差异的差异。你可能只想在一步中完成这个操作。

library(dplyr)
library(lubridate)

thresh <- duration(5, units = "minutes")

sample_df %>%
  mutate(TIME = mdy_hms(TIME)) %>%
  group_by(ID) %>%
  mutate(diff1 = TIME - lag(TIME)) %>%
  mutate(delta = if_else(diff1 < thresh, diff1, NA_real_))
#> # A tibble: 10 x 4
#> # Groups:   ID [3]
#>    ID    TIME                diff1      delta  
#>    <chr> <dttm>              <drtn>     <drtn> 
#>  1 A     2019-12-18 04:45:10    NA secs NA secs
#>  2 A     2019-12-18 04:45:11     1 secs  1 secs
#>  3 A     2019-12-18 16:06:59 40908 secs NA secs
#>  4 A     2019-12-18 16:07:01     2 secs  2 secs
#>  5 B     2019-12-18 04:14:13    NA secs NA secs
#>  6 B     2019-12-18 04:14:14     1 secs  1 secs
#>  7 B     2019-12-18 04:14:15     1 secs  1 secs
#>  8 C     2019-12-18 04:59:49    NA secs NA secs
#>  9 C     2019-12-18 04:59:50     1 secs  1 secs
#> 10 C     2019-12-18 04:59:51     1 secs  1 secs

使用dplyr::if_else而不是基本的ifelse很方便,因为它使用严格的类型,这有助于确保我将delta列保持为持续时间对象,而不是失去其时间组件并只获得一个数值,这将是使用NA而不是NA_real_的情况。

英文:

Like I mentioned above, the first thing to do is convert your date-times to a date-time object; I'm using lubridate for this. Since you want to keep delta within some threshold, I saved a threshold duration of 5 minutes which you can change as needed. If differences are more than that, make them NA.

I'm doing the diffing in 2 steps, just so you can see the original difference vs the one with long differences removed. You'll probably want to just do that in one step.

<!-- language-all: lang-r -->

library(dplyr)
library(lubridate)

thresh &lt;- duration(5, units = &quot;minutes&quot;)

sample_df %&gt;%
  mutate(TIME = mdy_hms(TIME)) %&gt;%
  group_by(ID) %&gt;%
  mutate(diff1 = TIME - lag(TIME)) %&gt;%
  mutate(delta = if_else(diff1 &lt; thresh, diff1, NA_real_))
#&gt; # A tibble: 10 x 4
#&gt; # Groups:   ID [3]
#&gt;    ID    TIME                diff1      delta  
#&gt;    &lt;chr&gt; &lt;dttm&gt;              &lt;drtn&gt;     &lt;drtn&gt; 
#&gt;  1 A     2019-12-18 04:45:10    NA secs NA secs
#&gt;  2 A     2019-12-18 04:45:11     1 secs  1 secs
#&gt;  3 A     2019-12-18 16:06:59 40908 secs NA secs
#&gt;  4 A     2019-12-18 16:07:01     2 secs  2 secs
#&gt;  5 B     2019-12-18 04:14:13    NA secs NA secs
#&gt;  6 B     2019-12-18 04:14:14     1 secs  1 secs
#&gt;  7 B     2019-12-18 04:14:15     1 secs  1 secs
#&gt;  8 C     2019-12-18 04:59:49    NA secs NA secs
#&gt;  9 C     2019-12-18 04:59:50     1 secs  1 secs
#&gt; 10 C     2019-12-18 04:59:51     1 secs  1 secs

Using dplyr::if_else rather than the base ifelse was handy because it uses strict typing, which helped make sure I kept the delta column as a duration object, rather than losing its time component and just getting a numeric, which would be the case with NA instead of NA_real_.

huangapple
  • 本文由 发表于 2020年1月3日 23:22:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581106.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定