英文:
R data.table rolling join etc to add dynamics to previous period
问题
我需要添加三列,以便对于每个id,我得到前一年的每个参数的比率。
我至少设法创建了三个新列,其中只包含前一年的参数值:
try01 <- example_data[example_data, on = .(id, year < year), mult="last"]
> try01
id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010 NA NA NA 10 100 100
2: id1 2011 10 100 100 10 150 100
3: id1 2012 10 150 100 20 200 0
4: id2 2010 NA NA NA 10 50 500
5: id2 2011 10 50 500 5 50 0
6: id2 2012 5 50 0 7 75 400
7: id2 2014 7 75 400 15 15 500
8: id2 2015 15 15 500 15 10 1000
这是可以的,但是在这里,原始数据位于表的右侧,而先前的值位于左侧 - 最好将新列添加到右侧而不是左侧。
原始问题是 - 我如何直接添加不是前一年的值,而是比率 "param当前年份/ param上一年份"?
也许我可以使用shift或类似的命令?
如果你想要在右侧添加新列,以显示每个参数的 "param当前年份/ param上一年份" 比率,你可以使用以下代码:
example_data[, c("param1_ratio", "param2_ratio", "param3_ratio") :=
.(param1/shift(param1), param2/shift(param2), param3/shift(param3)), by = id]
> example_data
id year param1 param2 param3 param1_ratio param2_ratio param3_ratio
1: id1 2010 10 100 100 NA NA NA
2: id1 2011 10 150 100 1 1.5 1
3: id1 2012 20 200 0 2 1.333333 0
4: id2 2010 10 50 500 NA NA NA
5: id2 2011 5 50 0 0.5 1 0
6: id2 2012 7 75 400 1.4 1.5 Inf
7: id2 2014 15 15 500 2.142857 0.2 1.25
8: id2 2015 15 10 1000 1 0.666666 2
上述代码会计算每个参数的 "param当前年份/ param上一年份" 比率,并将结果添加为新列(param1_ratio、param2_ratio 和 param3_ratio)。
英文:
There is a data.table, which has some data for different id's on yearly basis:
example_data <- data.table(
id = c('id1', 'id1', 'id1', 'id2', 'id2', 'id2', 'id2', 'id2'),
year = c(2010, 2011, 2012, 2010, 2011, 2012, 2014, 2015),
param1 = c(10, 10, 20, 10, 5, 7, 15, 15),
param2 = c(100, 150, 200, 50, 50, 75, 15, 10),
param3 = c(100, 100, 0, 500, 0, 400, 500, 1000)
)
> example_data
id year param1 param2 param3
1: id1 2010 10 100 100
2: id1 2011 10 150 100
3: id1 2012 20 200 0
4: id2 2010 10 50 500
5: id2 2011 5 50 0
6: id2 2012 7 75 400
7: id2 2014 15 15 500
8: id2 2015 15 10 1000
>
I need to add three columns, so that for each id I get a ratio of each param for previous year.
I managed at least to make three new columns with just the values of params for previous year:
try01 <- example_data[example_data, on = .(id, year < year), mult="last"]
> try01
id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010 NA NA NA 10 100 100
2: id1 2011 10 100 100 10 150 100
3: id1 2012 10 150 100 20 200 0
4: id2 2010 NA NA NA 10 50 500
5: id2 2011 10 50 500 5 50 0
6: id2 2012 5 50 0 7 75 400
7: id2 2014 7 75 400 15 15 500
8: id2 2015 15 15 500 15 10 1000
This is ok, but here the original data is on the right side of the table, and previous values are on the left - it would be better to add new columns to the right instead.
And the original question is - how can I directly add not values from previous years, but the ratio " param current year / param previous year"?
May be I could use shift or similar command?
答案1
得分: 4
你在使用shift()
的建议方面是正确的。这将比执行(笛卡尔积的)非等值连接然后对最后一个值进行子集化要快得多。
param_cols <- grep("^param", names(example_data), value = TRUE)
new_cols <- paste0(param_cols, "_ratio")
example_data[, (new_cols) := lapply(
.SD,
\(x) x / shift(x)
),
by = id,
.SDcols = param_cols
]
# id year param1 param2 param3 param1_ratio param2_ratio param3_ratio
# <char> <num> <num> <num> <num> <num> <num> <num>
# 1: id1 2010 10 100 100 NA NA NA
# 2: id1 2011 10 150 100 1.000000 1.5000000 1.00
# 3: id1 2012 20 200 0 2.000000 1.3333333 0.00
# 4: id2 2010 10 50 500 NA NA NA
# 5: id2 2011 5 50 0 0.500000 1.0000000 0.00
# 6: id2 2012 7 75 400 1.400000 1.5000000 Inf
# 7: id2 2014 15 15 500 2.142857 0.2000000 1.25
# 8: id2 2015 15 10 1000 1.000000 0.6666667 2.00
我假设你希望第一个值为NA
,因为没有先前的值。但是,如果不是这样,你可以在shift()
调用中更改fill
参数。例如,如果你希望它始终为1,只需按组除以第一个值,shift(x, fill = x[1])
。
类似地,如果先前的值为0
,你将得到Inf
而不是NA
。如果你不希望这样,你可以用NA
替换所有无限值,例如在前一个阶段之后:
example_data[,
(new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
.SDcols = new_cols
]
英文:
You're on the right track with the suggestion of using shift()
. That will be a lot quicker than doing a (cartesian) non-equi join and then subsetting the last value.
param_cols <- grep("^param", names(example_data), value = TRUE)
new_cols <- paste0(param_cols, "_ratio")
example_data[, (new_cols) := lapply(
.SD,
\(x) x / shift(x)
),
by = id,
.SDcols = param_cols
]
# id year param1 param2 param3 param1_ratio param2_ratio param3_ratio
# <char> <num> <num> <num> <num> <num> <num> <num>
# 1: id1 2010 10 100 100 NA NA NA
# 2: id1 2011 10 150 100 1.000000 1.5000000 1.00
# 3: id1 2012 20 200 0 2.000000 1.3333333 0.00
# 4: id2 2010 10 50 500 NA NA NA
# 5: id2 2011 5 50 0 0.500000 1.0000000 0.00
# 6: id2 2012 7 75 400 1.400000 1.5000000 Inf
# 7: id2 2014 15 15 500 2.142857 0.2000000 1.25
# 8: id2 2015 15 10 1000 1.000000 0.6666667 2.00
I've assumed you want the first value to be NA
as there is no previous value. However, if not, you can change the fill
parameter in your shift()
call. For example, if you want it to always be 1, just divide by the first value by group, shift(x, fill = x[1])
.
Similarly, you will get Inf
rather than NA
if the previous value is 0
. If you don't want this you replace all the infinite values with NA
, e.g. after the previous stage:
example_data[,
(new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
.SDcols = new_cols
]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论