2020年1月3日 23:22:53go评论125阅读模式

英文:

Grouping by ID, Grouping by time (within 5 minutes of each activity), Find Time Difference of Activity in R

问题

Sure, here's the translated content:

有没有一种方法可以让R按ID分组，然后识别时间上的“断裂”，然后计算时间差？
例如：

ID               TIME              
A                12/18/2019 4:45:10 AM
A                12/18/2019 4:45:11 AM
A                12/18/2019 9:06:59 PM               
B                12/18/2019 4:14:13 AM
B                12/18/2019 4:14:14 AM

有人知道如何找出A的时间持续时间吗？请注意，这不是一个difftime问题。我在上午4:45:10进行了某项活动，然后在上午4:45:11又进行了一次。然后我停止了这项活动，并在晚上9:06又重新开始了。是否有代码可以准确地分组ID，然后分组时间，同时检测时间上的巨大间隙，以避免不准确的值？

这不是正确的解决方案。

diff<- data %>%
mutate(diff = difftime(as.POSIXct(Endtime, format = "%m/%d/%Y %I:%M:%S %p"), 
as.POSIXct(Starttime, format = "%m/%d/%Y %I:%M:%S %p"), units = "secs"))

非常感谢任何帮助。
我将继续研究这个问题。谢谢。

英文:

Is there a way for R to group by ID, and then to identify a 'break' in time and then calculate time difference?
For instance:

                ID               TIME              
                A                12/18/2019 4:45:10 AM
                A                12/18/2019 4:45:11 AM
                A                12/18/2019 9:06:59 PM               
                B                12/18/2019 4:14:13 AM
                B                12/18/2019 4:14:14 AM

Does anyone know of a way to find the time duration for A? Notice this is not a difftime problem. I performed a certain activity at 4:45:10 am, then again at 4:45:11 am. I then stopped this activity, and picked back up at 9:06pm. Is there code that can accurately group IDs, and then group time whilst detecting a huge gap in the time to avoid inaccurate values?

This is not the correct solution.

                       diff&lt;- data %&gt;%
                       mutate(diff = difftime(as.POSIXct(Endtime, format = &quot;%m/%d/%Y %I:%M:%S %p&quot;), 
                       as.POSIXct(Starttime, format = &quot;%m/%d/%Y %I:%M:%S %p&quot;), units = &quot;secs&quot;))

Any help is greatly appreciated.
I will continue to research this. Thank you

答案1

得分: 1

这是一种方法：

library(lubridate)
sample_df$TIME = mdy_hms(sample_df$TIME)
sample_df = sample_df %>%
            group_by(ID) %>%
            # lag基本上将下一个值提前一步
            # 这样我们可以减去索引0和索引1、索引1和索引2等……
            mutate(time_diff = TIME - lag(TIME, n = 1, default = NA)) %>%
            mutate(time_diff = replace_na(time_diff, 0))

希望这能给你一些思路。为了理解，可以分为两步进行：

sample_df = sample_df %>%
            group_by(ID) %>%
            mutate(time_lag = dplyr::lag(TIME, n = 1, default = NA)) %>%
            mutate(time_diff = TIME - time_lag) %>%
            mutate(time_diff = replace_na(time_diff, 0))

检查一下 time_lag 列的样子。

英文:

Here's a way to do:

library(lubridate)
sample_df$TIME = mdy_hms(sample_df$TIME)
sample_df = sample_df %&gt;%
            group_by(ID) %&gt;%
            # lag basically bring the next value one step up
            # so we can subtract value at index 0 and index 1, index 1 and index 2 and so on....
            mutate(time_diff = TIME - lag(TIME, n = 1, default = NA)) %&gt;% 
            mutate(time_diff = replace_na(time_diff, 0))

Hope this gives you some idea.
For understanding, do it in two steps:

sample_df = sample_df %&gt;%
            group_by(ID) %&gt;%
            mutate(time_lag = dplyr::lag(TIME, n = 1, default = NA)) %&gt;% 
            mutate(time_diff = TIME - time_lag) %&gt;% 
            mutate(time_diff = replace_na(time_diff, 0))

Check how time_lag column looks.

答案2

得分: 1

就像我之前提到的那样，首先要将你的日期时间转换为日期时间对象；我使用lubridate来实现这一点。由于你想要在某个阈值内保持差异，我保存了一个阈值持续时间为5分钟，你可以根据需要进行更改。如果差异超过了这个阈值，就将它们设为NA。

我将差异分为2步进行，这样你可以看到原始差异与去除长时间差异的差异。你可能只想在一步中完成这个操作。

library(dplyr)
library(lubridate)
thresh <- duration(5, units = "minutes")
sample_df %>%
  mutate(TIME = mdy_hms(TIME)) %>%
  group_by(ID) %>%
  mutate(diff1 = TIME - lag(TIME)) %>%
  mutate(delta = if_else(diff1 < thresh, diff1, NA_real_))
#> # A tibble: 10 x 4
#> # Groups:   ID [3]
#>    ID    TIME                diff1      delta  
#>    <chr> <dttm>              <drtn>     <drtn> 
#>  1 A     2019-12-18 04:45:10    NA secs NA secs
#>  2 A     2019-12-18 04:45:11     1 secs  1 secs
#>  3 A     2019-12-18 16:06:59 40908 secs NA secs
#>  4 A     2019-12-18 16:07:01     2 secs  2 secs
#>  5 B     2019-12-18 04:14:13    NA secs NA secs
#>  6 B     2019-12-18 04:14:14     1 secs  1 secs
#>  7 B     2019-12-18 04:14:15     1 secs  1 secs
#>  8 C     2019-12-18 04:59:49    NA secs NA secs
#>  9 C     2019-12-18 04:59:50     1 secs  1 secs
#> 10 C     2019-12-18 04:59:51     1 secs  1 secs

使用dplyr::if_else而不是基本的ifelse很方便，因为它使用严格的类型，这有助于确保我将delta列保持为持续时间对象，而不是失去其时间组件并只获得一个数值，这将是使用NA而不是NA_real_的情况。

英文:

Like I mentioned above, the first thing to do is convert your date-times to a date-time object; I'm using lubridate for this. Since you want to keep delta within some threshold, I saved a threshold duration of 5 minutes which you can change as needed. If differences are more than that, make them NA.

I'm doing the diffing in 2 steps, just so you can see the original difference vs the one with long differences removed. You'll probably want to just do that in one step.

library(dplyr)
library(lubridate)
thresh &lt;- duration(5, units = &quot;minutes&quot;)
sample_df %&gt;%
  mutate(TIME = mdy_hms(TIME)) %&gt;%
  group_by(ID) %&gt;%
  mutate(diff1 = TIME - lag(TIME)) %&gt;%
  mutate(delta = if_else(diff1 &lt; thresh, diff1, NA_real_))
#&gt; # A tibble: 10 x 4
#&gt; # Groups:   ID [3]
#&gt;    ID    TIME                diff1      delta  
#&gt;    &lt;chr&gt; &lt;dttm&gt;              &lt;drtn&gt;     &lt;drtn&gt; 
#&gt;  1 A     2019-12-18 04:45:10    NA secs NA secs
#&gt;  2 A     2019-12-18 04:45:11     1 secs  1 secs
#&gt;  3 A     2019-12-18 16:06:59 40908 secs NA secs
#&gt;  4 A     2019-12-18 16:07:01     2 secs  2 secs
#&gt;  5 B     2019-12-18 04:14:13    NA secs NA secs
#&gt;  6 B     2019-12-18 04:14:14     1 secs  1 secs
#&gt;  7 B     2019-12-18 04:14:15     1 secs  1 secs
#&gt;  8 C     2019-12-18 04:59:49    NA secs NA secs
#&gt;  9 C     2019-12-18 04:59:50     1 secs  1 secs
#&gt; 10 C     2019-12-18 04:59:51     1 secs  1 secs

Using dplyr::if_else rather than the base ifelse was handy because it uses strict typing, which helped make sure I kept the delta column as a duration object, rather than losing its time component and just getting a numeric, which would be the case with NA instead of NA_real_.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Grouping by ID, Grouping by time (within 5 minutes of each activity), Find Time Difference of Activity in R

问题

答案1

答案2

Python的`for`循环不会更新`numpy`数组的值。

识别在28天内的指标演示和再次出席。

计算二进制图像的面积

从Shiny UI的sidebarPanel()和mainPanel()中移除边距。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。