创建一个变量/列,使用从开始日期算起的13周期内事件的计数。

huangapple go评论75阅读模式
英文:

Create a variable/column using count of events in a 13-week period from start date

问题

数据框包括案例号(chr)、入职日期[Y-m-d](POSIXct)和每个登录事件的提出日期[Y-m-d h:m:s](POSIXct)。我正在寻找一种方法来计算入职日期和入职后13周(或91天)之间发生的登录事件(提出日期)的数量。

log_df <- data.frame(
  case_number = c("000920", "000920", "000920", "000920", "000921", "000921"),
  onboarding_date = as.Date(c("2020-10-27", "2020-10-27", "2020-10-27", "2020-10-27", "2020-11-08", "2020-11-08")),
  raised_at = as.POSIXct(c("2020-10-27 02:52:31", "2020-10-27 05:40:51", "2020-10-27 05:46:25", "2020-10-28 09:05:21", "2020-11-08 23:39:15", "2020-11-11 00:19:16"))
)

我认为我需要创建一个间隔对象,但我对此还不太熟悉:

int <- interval(ymd("2001-01-01"), ymd("2002-01-01"))
interval = int_shift(int, duration(days = 91))

接下来我不太确定该怎么做。

现在我是否可以创建一个表,只包括使用onboarding_date + interval排除在我感兴趣的91天窗口之外的案例的raised_at事件?然后再按case_number分组计数raised_at?

在接下来的部分,我陷入了困境,自己感到困惑:

log_df %>%
  group_by(case_number) %>%
  mutate(interval = onboarding_date - min(onboarding_date) < 91,
         count_events = sum(raised_at) >= 1,
         event_within_interval = if_else(interval & count_events, 
                                         1, 0)) %>%
  ungroup()

它不起作用,我似乎无法使它工作。

mutate()中的错误:

! 在计算 count_events = sum(raised_at) >= 1 时出现问题。i 错误发生在组1中:case_number = "ACU0000920"。由于 Summary.POSIXct() 中的错误而引起:! 'sum' 对于 "POSIXt" 对象未定义

我理解它无法计算POSIXct日期,但我不知道如何修复这个错误,使我的代码工作。我确定我漏掉了一些明显的东西。如果您能帮助我看出我漏掉了什么,我非常感谢您给一个刚刚入门的人的明智建议。

英文:

The dataframe consists of case_number (chr), onboarding_date Y-m-d, and raised_at [Y-m-d h:m:s] (POSIXct) for each login event. I'm searching for a way to count how many login events (raised_at) occured between the onboarding date and 13-weeks (or 91 days) after onboarding.

log_df &lt;- data.frame(
  case_number = c(&quot;000920&quot;, &quot;000920&quot;, &quot;000920&quot;, &quot;000920&quot;, &quot;000921&quot;, &quot;000921&quot;),
  onboarding_date = as.Date(c(&quot;2020-10-27&quot;, &quot;2020-10-27&quot;, &quot;2020-10-27&quot;, &quot;2020-10-27&quot;, &quot;2020-11-08&quot;, &quot;2020-11-08&quot;)),
  raised_at = as.POSIXct(c(&quot;2020-10-27 02:52:31&quot;, &quot;2020-10-27 05:40:51&quot;, &quot;2020-10-27 05:46:25&quot;, &quot;2020-10-28 09:05:21&quot;, &quot;2020-11-08 23:39:15&quot;, &quot;2020-11-11 00:19:16&quot;))
)

I believe I need to create an interval object but I'm pretty new at this:

int &lt;- interval(ymd(&quot;2001-01-01&quot;), ymd(&quot;2002-01-01&quot;))
interval = int_shift(int, duration(days = 91))

I'm not sure what to do next.

Can I now create a table that will only include raised_at events using onboarding_date+interval to exclude cases outside of my 91 day window of interest?
And then follow that up with a count of raised_at with group_by(case_number)?

I am getting stuck and confusing myself at the next part:

log_df %&gt;%
  group_by(case_number) %&gt;%
  mutate(interval = onboarding_date - min(onboarding_date) &lt; 91,
         count_events = sum(raised_at) &gt;= 1,
         event_within_interval = if_else(interval &amp; count_events, 
                                         1, 0)) %&gt;%
  ungroup()

It doesn't work and I can't seem to make it work.

Error in mutate():

> ! Problem while computing count_events = sum(raised_at) &gt;= 1. i The
> error occurred in group 1: case_number = "ACU0000920". Caused by error
> in Summary.POSIXct(): ! 'sum' not defined for "POSIXt" objects

I understand that it cannot count the POSIXct dates but I don't know how to fix this error in my code to make it work. I'm sure I'm missing something obvious. If you are able to help me see what I am missing, I really appreciate your sage advice for someone just starting out.

答案1

得分: 1

您可以这样测量日期间隔。根据您的数据,目前无需分组,因为onboarding_datecase_number上似乎是恒定的。不分组速度更快。

log_df  <- log_df |
    mutate(
        interval_days = as.Date(raised_at) - onboarding_date,
        in_window = interval_days > 91
    )

log_df
#   case_number onboarding_date           raised_at interval_days in_window
# 1   1  000920      2020-10-27 2020-10-27 02:52:31        0 days     FALSE
# 2   2  000920      2020-10-27 2020-10-27 05:40:51        0 days     FALSE
# 3   3  000920      2020-10-27 2020-10-27 05:46:25        0 days     FALSE
# 4   4  000920      2020-10-27 2020-10-28 09:05:21        1 days     FALSE
# 5   5  000921      2020-11-08 2020-11-08 23:39:15        0 days     FALSE
# 6   6  000921      2020-11-08 2020-11-11 00:19:16        3 days     FALSE

然后,要计算每个组中有多少个条目,只需:

log_df |
    filter(in_window) |
    group_by(case_number) |
    summarise(n = n())

您的示例数据中没有间隔超过91天的情况,但在您的真实数据中,这应该提供您所需的结果。我使用了dplyr语法,因为您在这里使用了它。然而,在基本R中做类似的事情,您可以使用以下方式:

# 这将产生与上面相同的数据框
log_df <- within(log_df, {
    interval_days <- as.Date(raised_at) - as.Date(onboarding_date)
    in_window <- interval_days > 91
})

# 这将为每个`case_number`生成一个带名称的计数数组,表示窗口外的计数
with(log_df, tapply(in_window, case_number, sum))

# 或者如果您想要一个数据框
aggregate(in_window ~ case_number, data = log_df, sum)
英文:

You can measure the interval in days like this. There's no need to group at this stage, by the look of your data, as the onboarding_date appears to be constant by case_number. It's quicker not to group.

log_df  &lt;- log_df |&gt;
    mutate(
        interval_days = as.Date(raised_at) - onboarding_date,
        in_window = interval_days &gt; 91
    )

log_df
#   case_number onboarding_date           raised_at interval_days in_window
# 1   1  000920      2020-10-27 2020-10-27 02:52:31        0 days     FALSE
# 2   2  000920      2020-10-27 2020-10-27 05:40:51        0 days     FALSE
# 3   3  000920      2020-10-27 2020-10-27 05:46:25        0 days     FALSE
# 4   4  000920      2020-10-27 2020-10-28 09:05:21        1 days     FALSE
# 5   5  000921      2020-11-08 2020-11-08 23:39:15        0 days     FALSE
# 6   6  000921      2020-11-08 2020-11-11 00:19:16        3 days     FALSE

Then to count how many are in each group simply:

log_df |&gt;
    filter(in_window) |&gt;
    group_by(case_number) |&gt;
    summarise(n = n())

Your sample data doesn't have any cases where the interval is more than 91 days, but in your real data this should provide what you are asking for. I've used dplyr syntax as you're using it here. However, to do something similar in base R, you can do:

# This produces the same data.frame as above
log_df &lt;- within(log_df, {
    interval_days &lt;- as.Date(raised_at) - as.Date(onboarding_date)
    in_window &lt;- interval_days &gt; 91
})

# This produces a named array of counts outside the window for each `case_number` 
with(log_df, tapply(in_window, case_number, sum))

# Alternatively if you want a data.frame
aggregate(in_window ~ case_number, data = log_df, sum)

答案2

得分: 0

这应该可以工作!

log_df %>%
  mutate(events = map_dbl(onboarding_date, ~ sum(between(log_df$raised_at, .x, .x + days(91)))))
英文:

This should work!

log_df %&gt;%
  mutate(events = map_dbl(onboarding_date, ~ sum(between(log_df$raised_at, .x, .x + days(91)))))

huangapple
  • 本文由 发表于 2023年7月17日 15:46:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76702420.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定