2023年5月29日 23:02:02go评论74阅读模式

英文:

R data.table rolling join etc to add dynamics to previous period

问题

我需要添加三列，以便对于每个id，我得到前一年的每个参数的比率。
我至少设法创建了三个新列，其中只包含前一年的参数值：

try01 <- example_data[example_data, on = .(id, year < year), mult="last"]

> try01
    id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010     NA     NA     NA       10      100      100
2: id1 2011     10    100    100       10      150      100
3: id1 2012     10    150    100       20      200        0
4: id2 2010     NA     NA     NA       10       50      500
5: id2 2011     10     50    500        5       50        0
6: id2 2012      5     50      0        7       75      400
7: id2 2014      7     75    400       15       15      500
8: id2 2015     15     15    500       15       10     1000

这是可以的，但是在这里，原始数据位于表的右侧，而先前的值位于左侧 - 最好将新列添加到右侧而不是左侧。

原始问题是 - 我如何直接添加不是前一年的值，而是比率 "param当前年份/ param上一年份"？
也许我可以使用shift或类似的命令？

如果你想要在右侧添加新列，以显示每个参数的 "param当前年份/ param上一年份" 比率，你可以使用以下代码：

example_data[, c("param1_ratio", "param2_ratio", "param3_ratio") := 
                .(param1/shift(param1), param2/shift(param2), param3/shift(param3)), by = id]

> example_data
    id year param1 param2 param3 param1_ratio param2_ratio param3_ratio
1: id1 2010     10    100    100           NA           NA           NA
2: id1 2011     10    150    100            1          1.5            1
3: id1 2012     20    200      0            2     1.333333            0
4: id2 2010     10     50    500           NA           NA           NA
5: id2 2011      5     50      0          0.5            1            0
6: id2 2012      7     75    400          1.4          1.5           Inf
7: id2 2014     15     15    500     2.142857          0.2          1.25
8: id2 2015     15     10   1000            1     0.666666            2

上述代码会计算每个参数的 "param当前年份/ param上一年份" 比率，并将结果添加为新列（param1_ratio、param2_ratio 和 param3_ratio）。

英文:

There is a data.table, which has some data for different id's on yearly basis:

example_data &lt;- data.table(
  id = c(&#39;id1&#39;, &#39;id1&#39;, &#39;id1&#39;, &#39;id2&#39;, &#39;id2&#39;, &#39;id2&#39;, &#39;id2&#39;, &#39;id2&#39;),
  year = c(2010, 2011, 2012, 2010, 2011, 2012, 2014, 2015),
  param1 = c(10, 10, 20, 10, 5, 7, 15, 15),
  param2 = c(100, 150, 200, 50, 50, 75, 15, 10),
  param3 = c(100, 100, 0, 500, 0, 400, 500, 1000)
)
&gt; example_data
    id year param1 param2 param3
1: id1 2010     10    100    100
2: id1 2011     10    150    100
3: id1 2012     20    200      0
4: id2 2010     10     50    500
5: id2 2011      5     50      0
6: id2 2012      7     75    400
7: id2 2014     15     15    500
8: id2 2015     15     10   1000
&gt;

I need to add three columns, so that for each id I get a ratio of each param for previous year.
I managed at least to make three new columns with just the values of params for previous year:

try01 &lt;- example_data[example_data, on = .(id, year &lt; year), mult=&quot;last&quot;]

&gt; try01
    id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010     NA     NA     NA       10      100      100
2: id1 2011     10    100    100       10      150      100
3: id1 2012     10    150    100       20      200        0
4: id2 2010     NA     NA     NA       10       50      500
5: id2 2011     10     50    500        5       50        0
6: id2 2012      5     50      0        7       75      400
7: id2 2014      7     75    400       15       15      500
8: id2 2015     15     15    500       15       10     1000

This is ok, but here the original data is on the right side of the table, and previous values are on the left - it would be better to add new columns to the right instead.

And the original question is - how can I directly add not values from previous years, but the ratio " param current year / param previous year"?
May be I could use shift or similar command?

答案1

得分: 4

你在使用shift()的建议方面是正确的。这将比执行（笛卡尔积的）非等值连接然后对最后一个值进行子集化要快得多。

param_cols <- grep("^param", names(example_data), value = TRUE)
new_cols <- paste0(param_cols, "_ratio")
example_data[, (new_cols) := lapply(
    .SD,
    \(x) x / shift(x)
),
by = id,
.SDcols = param_cols
]

#        id  year param1 param2 param3 param1_ratio param2_ratio param3_ratio
#    <char> <num>  <num>  <num>  <num>        <num>        <num>        <num>
# 1:    id1  2010     10    100    100           NA           NA           NA
# 2:    id1  2011     10    150    100     1.000000    1.5000000         1.00
# 3:    id1  2012     20    200      0     2.000000    1.3333333         0.00
# 4:    id2  2010     10     50    500           NA           NA           NA
# 5:    id2  2011      5     50      0     0.500000    1.0000000         0.00
# 6:    id2  2012      7     75    400     1.400000    1.5000000          Inf
# 7:    id2  2014     15     15    500     2.142857    0.2000000         1.25
# 8:    id2  2015     15     10   1000     1.000000    0.6666667         2.00

我假设你希望第一个值为NA，因为没有先前的值。但是，如果不是这样，你可以在shift()调用中更改fill参数。例如，如果你希望它始终为1，只需按组除以第一个值，shift(x, fill = x[1])。

类似地，如果先前的值为0，你将得到Inf而不是NA。如果你不希望这样，你可以用NA替换所有无限值，例如在前一个阶段之后：

example_data[,
    (new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
    .SDcols = new_cols
]

英文:

You're on the right track with the suggestion of using shift(). That will be a lot quicker than doing a (cartesian) non-equi join and then subsetting the last value.

param_cols &lt;- grep(&quot;^param&quot;, names(example_data), value = TRUE)
new_cols &lt;- paste0(param_cols, &quot;_ratio&quot;)
example_data[, (new_cols) := lapply(
    .SD,
    \(x) x / shift(x)
),
by = id,
.SDcols = param_cols
]

#        id  year param1 param2 param3 param1_ratio param2_ratio param3_ratio
#    &lt;char&gt; &lt;num&gt;  &lt;num&gt;  &lt;num&gt;  &lt;num&gt;        &lt;num&gt;        &lt;num&gt;        &lt;num&gt;
# 1:    id1  2010     10    100    100           NA           NA           NA
# 2:    id1  2011     10    150    100     1.000000    1.5000000         1.00
# 3:    id1  2012     20    200      0     2.000000    1.3333333         0.00
# 4:    id2  2010     10     50    500           NA           NA           NA
# 5:    id2  2011      5     50      0     0.500000    1.0000000         0.00
# 6:    id2  2012      7     75    400     1.400000    1.5000000          Inf
# 7:    id2  2014     15     15    500     2.142857    0.2000000         1.25
# 8:    id2  2015     15     10   1000     1.000000    0.6666667         2.00

I've assumed you want the first value to be NA as there is no previous value. However, if not, you can change the fill parameter in your shift() call. For example, if you want it to always be 1, just divide by the first value by group, shift(x, fill = x[1]).

Similarly, you will get Inf rather than NA if the previous value is 0. If you don't want this you replace all the infinite values with NA, e.g. after the previous stage:

example_data[,
    (new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
    .SDcols = new_cols
]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R data.table滚动连接等，以添加到前一时期的动态

问题

答案1

无法使用R从网页中抓取第二个表格。

是否可以根据彼此之间的距离重新排列GPS点

将表格从纵向整合到横向，高效地完成。

如何使用循环对数据进行排序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论