2023年4月17日 18:34:05go评论69阅读模式

英文:

Filter ID based on three conditions, huge dataframe

问题

以下是您要翻译的内容：

我有以下示例代码

df &lt;- data.frame(id = c(1,1,1,2,2,2,2,3,3,4), 
                 car = c(&quot;subaru&quot;, &quot;audi&quot;, &quot;subaru&quot;, &quot;toyota&quot;, &quot;toyota&quot;, &quot;audi&quot;, &quot;subaru&quot;, &quot;nissan&quot;, &quot;nissan&quot;, &quot;chevrolet&quot;),
                buy_date = c(&quot;01/01/2000&quot;, &quot;01/01/2001&quot;, &quot;01/02/2001&quot;, &quot;01/01/2000&quot;, &quot;01/12/2004&quot;, &quot;01/01/2005&quot;, &quot;01/03/2005&quot;, &quot;01/01/2000&quot;, &quot;02/01/2000&quot;, &quot;01/01/2010&quot;))
df$buy_date &lt;- as.Date(df$buy_date, format=&quot;%d/%m/%Y&quot;)

我想要筛选以下内容：
对于每个ID，如果他/她在180天内购买了两种不同类型的汽车，我想保留该ID的行。因此，它应该返回类似于以下列表：

| id | car | buy_date |
|:---|:----:|-------:|		
|1	|audi	|2001-01-01|		
|1	|subaru	|2001-02-01|		
|2	|toyota	|2004-12-01|		
|2	|audi	|2005-01-01|		
|2	|subaru	|2005-03-01|

但是，由于我的原始数据框有1200万行，通过创建新行（平方）来执行此操作的inner_join，这使得数据框对R Studio来说太大而无法处理。

我在想是否可能通过按行进行比较，例如，一个新列，该列提供了以天为单位的时间差异（每个ID的第一行为NA（而不是0）），然后进行一些筛选，比较第1行和第2行中的汽车名称以及天数差异 - 因此，如果汽车不同且天数不超过180天，它将保留行。

提前感谢

如果您需要进一步的帮助或解释，请随时告诉我。

英文:

I have the following sample code

df &lt;- data.frame(id = c(1,1,1,2,2,2,2,3,3,4), 
                 car = c(&quot;subaru&quot;, &quot;audi&quot;, &quot;subaru&quot;, &quot;toyota&quot;, &quot;toyota&quot;, &quot;audi&quot;, &quot;subaru&quot;, &quot;nissan&quot;, &quot;nissan&quot;, &quot;chevrolet&quot;),
                buy_date = c(&quot;01/01/2000&quot;, &quot;01/01/2001&quot;, &quot;01/02/2001&quot;, &quot;01/01/2000&quot;, &quot;01/12/2004&quot;, &quot;01/01/2005&quot;, &quot;01/03/2005&quot;, &quot;01/01/2000&quot;, &quot;02/01/2000&quot;, &quot;01/01/2010&quot;))
df$buy_date &lt;- as.Date(df$buy_date, format=&quot;%d/%m/%Y&quot;)

And I want filter for the following:
For each ID, I wanna keep the rows from the ID if he/she has bought two different type of cars within 180 days. So it should return a list something like this:

id	car	buy_date
1	audi	2001-01-01
1	subaru	2001-02-01
2	toyota	2004-12-01
2	audi	2005-01-01
2	subaru	2005-03-01

However, since my original dataframe have 12 million rows to start with, doing this through inner_join which creates new rows (squared), which makes the dataframe too big for R studio to handle.

I was thinking if it was possible to do a comparison by row, e.g., a new column which gives the time difference in days (with the first row of each ID being NA (and not 0)), for then to do some filtering that compares car name in row 1 and 2 and the difference in days - so if cars are different and the amount of days are 180 or below = it would keep the row.

Thanks in advance

答案1

得分: 1

你可能想要使用 data.table 来处理这个问题，因为在数据相对较大时，它通常比其他方法更快。

基本的逻辑如下：

计算每个ID的自上次购买以来的天数。
将每一行及其前一行的型号捕获到列表中，并计算有多少个唯一型号。
保留那些具有超过一个唯一型号且距离上次购买不超过180天的行。

我还将你的 Date 对象转换为 IDate，这是一种用于快速排序和分组的整数日期存储方式。正如文档所述：

使用整数存储允许将日期和/或时间用作数据表键。对于范围小于100,000的正整数，分组和排序速度很快，因为可以使用基数排序。

总的来说，这应该相当快。就性能而言，由于这些计算是按组而不是按行执行的，最重要的因素将是你的12m行中有多少个不同的 id。

英文:

You probably want to use data.table for this, as it tends to be faster than alternatives when the data is fairly large.

Essentially the logic is as you set out:

Calculate the number of days since the previous purchase, by ID.
Capture the model of each row and the preceding row in a list, and count how many unique models they're are.
Keep rows where there's more than one unique model type and the time since the previous purchase is less than or equal to 180 days.

library(data.table)
setDT(df)
df[
    ,
    buy_date := as.IDate(buy_date)
][, `:=`(
    days_since_last = buy_date - shift(buy_date),
    types = split(
        c(car, shift(car)),
        ceiling(seq(.N))
    )
),
by = .(id)
][
    ,
    num_types := sapply(types, \(x) uniqueN(na.omit(unlist(x))))
][
    ,
    keep := (
        days_since_last &lt;= 180 |
            shift(days_since_last, type = &quot;lead&quot;) &lt;= 180
    ) &amp; (
        num_types &gt; 1 | shift(num_types &gt; 1, type = &quot;lead&quot;)
    ),
    by = .(id)
]


df[(keep), .(id, car, buy_date)] 

#       id    car   buy_date
#    &lt;num&gt; &lt;char&gt;     &lt;IDat&gt;
# 1:     1   audi 2001-01-01
# 2:     1 subaru 2001-02-01
# 3:     2 toyota 2004-12-01
# 4:     2   audi 2005-01-01
# 5:     2 subaru 2005-03-01

I also converted your Date object to IDate, integer [date] storage for fast sorting and grouping. As the docs state:

> Using integer storage allows dates and/or times to be used as data table keys. With positive integers with a range less than 100,000, grouping and sorting is fast because radix sorting can be used.

Overall, this should be fairly fast. In terms of performance, as these calculations are done by group rather than by row, the biggest factor will be how many ids are in your 12m rows.

答案2

得分: 0

The Tidyverse way:

library(dplyr)

df %>%
    group_by(id) %>%
    arrange(id, buy_date) %>%
    mutate(days_between = buy_date - lag(buy_date)) %>%
    mutate(morethanOneCar = +(n_distinct(car) > 1)) %>%
    mutate(lessthan180 = +(days_between <= 180)) %>%
    filter(sum(morethanOneCar) > 1 &
           sum(lessthan180, na.rm = TRUE) > 0) %>%
    filter(!is.na(days_between)) %>%
    ungroup()

我已将代码部分翻译好，没有其他内容。

英文:

The Tidyverse way:

library(dplyr)

df %&gt;% 
    group_by(id) %&gt;% 
    arrange(id, buy_date) %&gt;% 
    mutate(days_between = buy_date - lag(buy_date)) %&gt;%
    mutate(morethanOneCar = +(n_distinct(car) &gt; 1)) %&gt;% 
    mutate(lessthan180 = +(days_between &lt;= 180)) %&gt;% 
    filter(sum(morethanOneCar) &gt; 1 &amp; 
               sum(lessthan180, na.rm = TRUE) &gt; 0) %&gt;% 
    filter(!is.na(days_between)) %&gt;% 
    ungroup()

I have here broken up the conditions into separate columns and tests to make it a bit clearer. I include all steps as separate columns, but it would be easy-peasy too exclude them.

group by id and arrange by buy date
calculate days since last purchase
check if each id has bought more than one unique car brand
check if there is less than or equal to 180 days since last purchase
filter id on where each id has more than one car and that has purchased a car within 180 days from the last purchase
filter away the first purchase of each of those id's

I am sure it is possible to make it more concise. For example you can make the calculations within the first filter and then filter on buy_date != min(buy_date).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于三个条件筛选 ID，大型数据框。

问题

答案1

答案2

如何根据省略/包含的变量模式在R中以编程方式创建公式。

如何从SPSS文件中的数据框中使用标签替换值？

在Rmarkdown中如何在千位数之间添加空格？

使用CASE语句根据同一列中的非空单元格来更新空单元格如何操作？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论