2023年7月17日 15:46:59go评论100阅读模式

英文:

Create a variable/column using count of events in a 13-week period from start date

问题

数据框包括案例号（chr）、入职日期[Y-m-d]（POSIXct）和每个登录事件的提出日期[Y-m-d h:m:s]（POSIXct）。我正在寻找一种方法来计算入职日期和入职后13周（或91天）之间发生的登录事件（提出日期）的数量。

log_df <- data.frame(
  case_number = c("000920", "000920", "000920", "000920", "000921", "000921"),
  onboarding_date = as.Date(c("2020-10-27", "2020-10-27", "2020-10-27", "2020-10-27", "2020-11-08", "2020-11-08")),
  raised_at = as.POSIXct(c("2020-10-27 02:52:31", "2020-10-27 05:40:51", "2020-10-27 05:46:25", "2020-10-28 09:05:21", "2020-11-08 23:39:15", "2020-11-11 00:19:16"))
)

我认为我需要创建一个间隔对象，但我对此还不太熟悉：

int <- interval(ymd("2001-01-01"), ymd("2002-01-01"))
interval = int_shift(int, duration(days = 91))

接下来我不太确定该怎么做。

现在我是否可以创建一个表，只包括使用onboarding_date + interval排除在我感兴趣的91天窗口之外的案例的raised_at事件？然后再按case_number分组计数raised_at？

在接下来的部分，我陷入了困境，自己感到困惑：

log_df %>%
  group_by(case_number) %>%
  mutate(interval = onboarding_date - min(onboarding_date) < 91,
         count_events = sum(raised_at) >= 1,
         event_within_interval = if_else(interval & count_events, 
                                         1, 0)) %>%
  ungroup()

它不起作用，我似乎无法使它工作。

mutate()中的错误：

! 在计算 count_events = sum(raised_at) >= 1 时出现问题。i 错误发生在组1中：case_number = "ACU0000920"。由于 Summary.POSIXct() 中的错误而引起：! 'sum' 对于 "POSIXt" 对象未定义

我理解它无法计算POSIXct日期，但我不知道如何修复这个错误，使我的代码工作。我确定我漏掉了一些明显的东西。如果您能帮助我看出我漏掉了什么，我非常感谢您给一个刚刚入门的人的明智建议。

英文:

The dataframe consists of case_number (chr), onboarding_date Y-m-d, and raised_at [Y-m-d h:m:s] (POSIXct) for each login event. I'm searching for a way to count how many login events (raised_at) occured between the onboarding date and 13-weeks (or 91 days) after onboarding.

log_df &lt;- data.frame(
  case_number = c(&quot;000920&quot;, &quot;000920&quot;, &quot;000920&quot;, &quot;000920&quot;, &quot;000921&quot;, &quot;000921&quot;),
  onboarding_date = as.Date(c(&quot;2020-10-27&quot;, &quot;2020-10-27&quot;, &quot;2020-10-27&quot;, &quot;2020-10-27&quot;, &quot;2020-11-08&quot;, &quot;2020-11-08&quot;)),
  raised_at = as.POSIXct(c(&quot;2020-10-27 02:52:31&quot;, &quot;2020-10-27 05:40:51&quot;, &quot;2020-10-27 05:46:25&quot;, &quot;2020-10-28 09:05:21&quot;, &quot;2020-11-08 23:39:15&quot;, &quot;2020-11-11 00:19:16&quot;))
)

I believe I need to create an interval object but I'm pretty new at this:

int &lt;- interval(ymd(&quot;2001-01-01&quot;), ymd(&quot;2002-01-01&quot;))
interval = int_shift(int, duration(days = 91))

I'm not sure what to do next.

Can I now create a table that will only include raised_at events using onboarding_date+interval to exclude cases outside of my 91 day window of interest?
And then follow that up with a count of raised_at with group_by(case_number)?

I am getting stuck and confusing myself at the next part:

log_df %&gt;%
  group_by(case_number) %&gt;%
  mutate(interval = onboarding_date - min(onboarding_date) &lt; 91,
         count_events = sum(raised_at) &gt;= 1,
         event_within_interval = if_else(interval &amp; count_events, 
                                         1, 0)) %&gt;%
  ungroup()

It doesn't work and I can't seem to make it work.

Error in mutate():

> ! Problem while computing count_events = sum(raised_at) >= 1. i The
> error occurred in group 1: case_number = "ACU0000920". Caused by error
> in Summary.POSIXct(): ! 'sum' not defined for "POSIXt" objects

I understand that it cannot count the POSIXct dates but I don't know how to fix this error in my code to make it work. I'm sure I'm missing something obvious. If you are able to help me see what I am missing, I really appreciate your sage advice for someone just starting out.

答案1

得分: 1

您可以这样测量日期间隔。根据您的数据，目前无需分组，因为onboarding_date在case_number上似乎是恒定的。不分组速度更快。

log_df  <- log_df |
    mutate(
        interval_days = as.Date(raised_at) - onboarding_date,
        in_window = interval_days > 91
    )
log_df
#   case_number onboarding_date           raised_at interval_days in_window
# 1   1  000920      2020-10-27 2020-10-27 02:52:31        0 days     FALSE
# 2   2  000920      2020-10-27 2020-10-27 05:40:51        0 days     FALSE
# 3   3  000920      2020-10-27 2020-10-27 05:46:25        0 days     FALSE
# 4   4  000920      2020-10-27 2020-10-28 09:05:21        1 days     FALSE
# 5   5  000921      2020-11-08 2020-11-08 23:39:15        0 days     FALSE
# 6   6  000921      2020-11-08 2020-11-11 00:19:16        3 days     FALSE

然后，要计算每个组中有多少个条目，只需：

log_df |
    filter(in_window) |
    group_by(case_number) |
    summarise(n = n())

您的示例数据中没有间隔超过91天的情况，但在您的真实数据中，这应该提供您所需的结果。我使用了dplyr语法，因为您在这里使用了它。然而，在基本R中做类似的事情，您可以使用以下方式：

# 这将产生与上面相同的数据框
log_df <- within(log_df, {
    interval_days <- as.Date(raised_at) - as.Date(onboarding_date)
    in_window <- interval_days > 91
})
# 这将为每个`case_number`生成一个带名称的计数数组，表示窗口外的计数
with(log_df, tapply(in_window, case_number, sum))
# 或者如果您想要一个数据框
aggregate(in_window ~ case_number, data = log_df, sum)

英文:

You can measure the interval in days like this. There's no need to group at this stage, by the look of your data, as the onboarding_date appears to be constant by case_number. It's quicker not to group.

log_df  &lt;- log_df |&gt;
    mutate(
        interval_days = as.Date(raised_at) - onboarding_date,
        in_window = interval_days &gt; 91
    )
log_df
#   case_number onboarding_date           raised_at interval_days in_window
# 1   1  000920      2020-10-27 2020-10-27 02:52:31        0 days     FALSE
# 2   2  000920      2020-10-27 2020-10-27 05:40:51        0 days     FALSE
# 3   3  000920      2020-10-27 2020-10-27 05:46:25        0 days     FALSE
# 4   4  000920      2020-10-27 2020-10-28 09:05:21        1 days     FALSE
# 5   5  000921      2020-11-08 2020-11-08 23:39:15        0 days     FALSE
# 6   6  000921      2020-11-08 2020-11-11 00:19:16        3 days     FALSE

Then to count how many are in each group simply:

log_df |&gt;
    filter(in_window) |&gt;
    group_by(case_number) |&gt;
    summarise(n = n())

Your sample data doesn't have any cases where the interval is more than 91 days, but in your real data this should provide what you are asking for. I've used dplyr syntax as you're using it here. However, to do something similar in base R, you can do:

# This produces the same data.frame as above
log_df &lt;- within(log_df, {
    interval_days &lt;- as.Date(raised_at) - as.Date(onboarding_date)
    in_window &lt;- interval_days &gt; 91
})
# This produces a named array of counts outside the window for each `case_number` 
with(log_df, tapply(in_window, case_number, sum))
# Alternatively if you want a data.frame
aggregate(in_window ~ case_number, data = log_df, sum)

答案2

得分: 0

这应该可以工作！

log_df %>%
  mutate(events = map_dbl(onboarding_date, ~ sum(between(log_df$raised_at, .x, .x + days(91)))))

英文:

This should work!

log_df %&gt;%
  mutate(events = map_dbl(onboarding_date, ~ sum(between(log_df$raised_at, .x, .x + days(91)))))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建一个变量/列，使用从开始日期算起的13周期内事件的计数。

问题

答案1

答案2

在R中SQL查询中传递多个参数

IV回归与聚类标准误

如何插入字符串以使每一行具有相同的模式

增加一个值，如果一行中的数字发生变化。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。