英文:
Filter ID based on three conditions, huge dataframe
问题
以下是您要翻译的内容:
我有以下示例代码
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3,4),
car = c("subaru", "audi", "subaru", "toyota", "toyota", "audi", "subaru", "nissan", "nissan", "chevrolet"),
buy_date = c("01/01/2000", "01/01/2001", "01/02/2001", "01/01/2000", "01/12/2004", "01/01/2005", "01/03/2005", "01/01/2000", "02/01/2000", "01/01/2010"))
df$buy_date <- as.Date(df$buy_date, format="%d/%m/%Y")
我想要筛选以下内容:
对于每个ID,如果他/她在180天内购买了两种不同类型的汽车,我想保留该ID的行。因此,它应该返回类似于以下列表:
| id | car | buy_date |
|:---|:----:|-------:|
|1 |audi |2001-01-01|
|1 |subaru |2001-02-01|
|2 |toyota |2004-12-01|
|2 |audi |2005-01-01|
|2 |subaru |2005-03-01|
但是,由于我的原始数据框有1200万行,通过创建新行(平方)来执行此操作的inner_join,这使得数据框对R Studio来说太大而无法处理。
我在想是否可能通过按行进行比较,例如,一个新列,该列提供了以天为单位的时间差异(每个ID的第一行为NA(而不是0)),然后进行一些筛选,比较第1行和第2行中的汽车名称以及天数差异 - 因此,如果汽车不同且天数不超过180天,它将保留行。
提前感谢
如果您需要进一步的帮助或解释,请随时告诉我。
英文:
I have the following sample code
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3,4),
car = c("subaru", "audi", "subaru", "toyota", "toyota", "audi", "subaru", "nissan", "nissan", "chevrolet"),
buy_date = c("01/01/2000", "01/01/2001", "01/02/2001", "01/01/2000", "01/12/2004", "01/01/2005", "01/03/2005", "01/01/2000", "02/01/2000", "01/01/2010"))
df$buy_date <- as.Date(df$buy_date, format="%d/%m/%Y")
And I want filter for the following:
For each ID, I wanna keep the rows from the ID if he/she has bought two different type of cars within 180 days. So it should return a list something like this:
id | car | buy_date |
---|---|---|
1 | audi | 2001-01-01 |
1 | subaru | 2001-02-01 |
2 | toyota | 2004-12-01 |
2 | audi | 2005-01-01 |
2 | subaru | 2005-03-01 |
However, since my original dataframe have 12 million rows to start with, doing this through inner_join which creates new rows (squared), which makes the dataframe too big for R studio to handle.
I was thinking if it was possible to do a comparison by row, e.g., a new column which gives the time difference in days (with the first row of each ID being NA (and not 0)), for then to do some filtering that compares car name in row 1 and 2 and the difference in days - so if cars are different and the amount of days are 180 or below = it would keep the row.
Thanks in advance
答案1
得分: 1
你可能想要使用 data.table
来处理这个问题,因为在数据相对较大时,它通常比其他方法更快。
基本的逻辑如下:
- 计算每个ID的自上次购买以来的天数。
- 将每一行及其前一行的型号捕获到列表中,并计算有多少个唯一型号。
- 保留那些具有超过一个唯一型号且距离上次购买不超过180天的行。
我还将你的 Date
对象转换为 IDate
,这是一种用于快速排序和分组的整数日期存储方式。正如文档所述:
使用整数存储允许将日期和/或时间用作数据表键。对于范围小于100,000的正整数,分组和排序速度很快,因为可以使用基数排序。
总的来说,这应该相当快。就性能而言,由于这些计算是按组而不是按行执行的,最重要的因素将是你的12m行中有多少个不同的 id
。
英文:
You probably want to use data.table
for this, as it tends to be faster than alternatives when the data is fairly large.
Essentially the logic is as you set out:
- Calculate the number of days since the previous purchase, by ID.
- Capture the model of each row and the preceding row in a list, and count how many unique models they're are.
- Keep rows where there's more than one unique model type and the time since the previous purchase is less than or equal to 180 days.
library(data.table)
setDT(df)
df[
,
buy_date := as.IDate(buy_date)
][, `:=`(
days_since_last = buy_date - shift(buy_date),
types = split(
c(car, shift(car)),
ceiling(seq(.N))
)
),
by = .(id)
][
,
num_types := sapply(types, \(x) uniqueN(na.omit(unlist(x))))
][
,
keep := (
days_since_last <= 180 |
shift(days_since_last, type = "lead") <= 180
) & (
num_types > 1 | shift(num_types > 1, type = "lead")
),
by = .(id)
]
df[(keep), .(id, car, buy_date)]
# id car buy_date
# <num> <char> <IDat>
# 1: 1 audi 2001-01-01
# 2: 1 subaru 2001-02-01
# 3: 2 toyota 2004-12-01
# 4: 2 audi 2005-01-01
# 5: 2 subaru 2005-03-01
I also converted your Date
object to IDate
, integer [date] storage for fast sorting and grouping. As the docs state:
> Using integer storage allows dates and/or times to be used as data table keys. With positive integers with a range less than 100,000, grouping and sorting is fast because radix sorting can be used.
Overall, this should be fairly fast. In terms of performance, as these calculations are done by group rather than by row, the biggest factor will be how many id
s are in your 12m rows.
答案2
得分: 0
The Tidyverse way:
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, buy_date) %>%
mutate(days_between = buy_date - lag(buy_date)) %>%
mutate(morethanOneCar = +(n_distinct(car) > 1)) %>%
mutate(lessthan180 = +(days_between <= 180)) %>%
filter(sum(morethanOneCar) > 1 &
sum(lessthan180, na.rm = TRUE) > 0) %>%
filter(!is.na(days_between)) %>%
ungroup()
我已将代码部分翻译好,没有其他内容。
英文:
The Tidyverse way:
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, buy_date) %>%
mutate(days_between = buy_date - lag(buy_date)) %>%
mutate(morethanOneCar = +(n_distinct(car) > 1)) %>%
mutate(lessthan180 = +(days_between <= 180)) %>%
filter(sum(morethanOneCar) > 1 &
sum(lessthan180, na.rm = TRUE) > 0) %>%
filter(!is.na(days_between)) %>%
ungroup()
I have here broken up the conditions into separate columns and tests to make it a bit clearer. I include all steps as separate columns, but it would be easy-peasy too exclude them.
- group by id and arrange by buy date
- calculate days since last purchase
- check if each id has bought more than one unique car brand
- check if there is less than or equal to 180 days since last purchase
- filter id on where each id has more than one car and that has purchased a car within 180 days from the last purchase
- filter away the first purchase of each of those id's
I am sure it is possible to make it more concise. For example you can make the calculations within the first filter and then filter on buy_date != min(buy_date)
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论