R data.table滚动连接等,以添加到前一时期的动态

huangapple go评论58阅读模式
英文:

R data.table rolling join etc to add dynamics to previous period

问题

我需要添加三列,以便对于每个id,我得到前一年的每个参数的比率。
我至少设法创建了三个新列,其中只包含前一年的参数值:

try01 <- example_data[example_data, on = .(id, year < year), mult="last"]

> try01
    id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010     NA     NA     NA       10      100      100
2: id1 2011     10    100    100       10      150      100
3: id1 2012     10    150    100       20      200        0
4: id2 2010     NA     NA     NA       10       50      500
5: id2 2011     10     50    500        5       50        0
6: id2 2012      5     50      0        7       75      400
7: id2 2014      7     75    400       15       15      500
8: id2 2015     15     15    500       15       10     1000

这是可以的,但是在这里,原始数据位于表的右侧,而先前的值位于左侧 - 最好将新列添加到右侧而不是左侧。

原始问题是 - 我如何直接添加不是前一年的值,而是比率 "param当前年份/ param上一年份"?
也许我可以使用shift或类似的命令?

如果你想要在右侧添加新列,以显示每个参数的 "param当前年份/ param上一年份" 比率,你可以使用以下代码:

example_data[, c("param1_ratio", "param2_ratio", "param3_ratio") := 
                .(param1/shift(param1), param2/shift(param2), param3/shift(param3)), by = id]

> example_data
    id year param1 param2 param3 param1_ratio param2_ratio param3_ratio
1: id1 2010     10    100    100           NA           NA           NA
2: id1 2011     10    150    100            1          1.5            1
3: id1 2012     20    200      0            2     1.333333            0
4: id2 2010     10     50    500           NA           NA           NA
5: id2 2011      5     50      0          0.5            1            0
6: id2 2012      7     75    400          1.4          1.5           Inf
7: id2 2014     15     15    500     2.142857          0.2          1.25
8: id2 2015     15     10   1000            1     0.666666            2

上述代码会计算每个参数的 "param当前年份/ param上一年份" 比率,并将结果添加为新列(param1_ratio、param2_ratio 和 param3_ratio)。

英文:

There is a data.table, which has some data for different id's on yearly basis:

example_data &lt;- data.table(
  id = c(&#39;id1&#39;, &#39;id1&#39;, &#39;id1&#39;, &#39;id2&#39;, &#39;id2&#39;, &#39;id2&#39;, &#39;id2&#39;, &#39;id2&#39;),
  year = c(2010, 2011, 2012, 2010, 2011, 2012, 2014, 2015),
  param1 = c(10, 10, 20, 10, 5, 7, 15, 15),
  param2 = c(100, 150, 200, 50, 50, 75, 15, 10),
  param3 = c(100, 100, 0, 500, 0, 400, 500, 1000)
)
&gt; example_data
    id year param1 param2 param3
1: id1 2010     10    100    100
2: id1 2011     10    150    100
3: id1 2012     20    200      0
4: id2 2010     10     50    500
5: id2 2011      5     50      0
6: id2 2012      7     75    400
7: id2 2014     15     15    500
8: id2 2015     15     10   1000
&gt; 

I need to add three columns, so that for each id I get a ratio of each param for previous year.
I managed at least to make three new columns with just the values of params for previous year:

try01 &lt;- example_data[example_data, on = .(id, year &lt; year), mult=&quot;last&quot;]

&gt; try01
    id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010     NA     NA     NA       10      100      100
2: id1 2011     10    100    100       10      150      100
3: id1 2012     10    150    100       20      200        0
4: id2 2010     NA     NA     NA       10       50      500
5: id2 2011     10     50    500        5       50        0
6: id2 2012      5     50      0        7       75      400
7: id2 2014      7     75    400       15       15      500
8: id2 2015     15     15    500       15       10     1000

This is ok, but here the original data is on the right side of the table, and previous values are on the left - it would be better to add new columns to the right instead.

And the original question is - how can I directly add not values from previous years, but the ratio " param current year / param previous year"?
May be I could use shift or similar command?

答案1

得分: 4

你在使用shift()的建议方面是正确的。这将比执行(笛卡尔积的)非等值连接然后对最后一个值进行子集化要快得多。

param_cols <- grep("^param", names(example_data), value = TRUE)
new_cols <- paste0(param_cols, "_ratio")
example_data[, (new_cols) := lapply(
    .SD,
    \(x) x / shift(x)
),
by = id,
.SDcols = param_cols
]

#        id  year param1 param2 param3 param1_ratio param2_ratio param3_ratio
#    <char> <num>  <num>  <num>  <num>        <num>        <num>        <num>
# 1:    id1  2010     10    100    100           NA           NA           NA
# 2:    id1  2011     10    150    100     1.000000    1.5000000         1.00
# 3:    id1  2012     20    200      0     2.000000    1.3333333         0.00
# 4:    id2  2010     10     50    500           NA           NA           NA
# 5:    id2  2011      5     50      0     0.500000    1.0000000         0.00
# 6:    id2  2012      7     75    400     1.400000    1.5000000          Inf
# 7:    id2  2014     15     15    500     2.142857    0.2000000         1.25
# 8:    id2  2015     15     10   1000     1.000000    0.6666667         2.00

我假设你希望第一个值为NA,因为没有先前的值。但是,如果不是这样,你可以在shift()调用中更改fill参数。例如,如果你希望它始终为1,只需按组除以第一个值,shift(x, fill = x[1])

类似地,如果先前的值为0,你将得到Inf而不是NA。如果你不希望这样,你可以用NA替换所有无限值,例如在前一个阶段之后:

example_data[,
    (new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
    .SDcols = new_cols
]
英文:

You're on the right track with the suggestion of using shift(). That will be a lot quicker than doing a (cartesian) non-equi join and then subsetting the last value.

param_cols &lt;- grep(&quot;^param&quot;, names(example_data), value = TRUE)
new_cols &lt;- paste0(param_cols, &quot;_ratio&quot;)
example_data[, (new_cols) := lapply(
    .SD,
    \(x) x / shift(x)
),
by = id,
.SDcols = param_cols
]

#        id  year param1 param2 param3 param1_ratio param2_ratio param3_ratio
#    &lt;char&gt; &lt;num&gt;  &lt;num&gt;  &lt;num&gt;  &lt;num&gt;        &lt;num&gt;        &lt;num&gt;        &lt;num&gt;
# 1:    id1  2010     10    100    100           NA           NA           NA
# 2:    id1  2011     10    150    100     1.000000    1.5000000         1.00
# 3:    id1  2012     20    200      0     2.000000    1.3333333         0.00
# 4:    id2  2010     10     50    500           NA           NA           NA
# 5:    id2  2011      5     50      0     0.500000    1.0000000         0.00
# 6:    id2  2012      7     75    400     1.400000    1.5000000          Inf
# 7:    id2  2014     15     15    500     2.142857    0.2000000         1.25
# 8:    id2  2015     15     10   1000     1.000000    0.6666667         2.00

I've assumed you want the first value to be NA as there is no previous value. However, if not, you can change the fill parameter in your shift() call. For example, if you want it to always be 1, just divide by the first value by group, shift(x, fill = x[1]).

Similarly, you will get Inf rather than NA if the previous value is 0. If you don't want this you replace all the infinite values with NA, e.g. after the previous stage:

example_data[,
    (new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
    .SDcols = new_cols
]

huangapple
  • 本文由 发表于 2023年5月29日 23:02:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76358397.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定