2023年2月27日 10:41:52go评论87阅读模式

英文:

Subsetting a long-data.table using values of a column within the data.table and casting the other values

问题

你可以尝试以下代码来生成你想要的输出：

output_dt <- dcast(dt, firm_id + metric ~ rowid(metric), value.var = "value")
output_dt <- dcast(output_dt, firm_id + 2 ~ metric, value.var = "value")
colnames(output_dt) <- c("firm_id", "NA", "AN_BILANT", "CAPEX", "OPEX")
output_dt[, NA := NULL]

这段代码首先使用dcast函数创建一个临时数据表output_dt，然后再次使用dcast函数将数据表进行重塑，最后删除不需要的列。这样，你将得到你期望的输出。

请注意，为了实现这一目标，我在代码中使用了一些技巧，如引入一个临时列"NA"来占位，然后删除它，以得到与你期望的输出一致的列名。

英文:

I have a 22 million observation rows data table of the following form:

`dt &lt;- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c(&quot;AN_BILANT&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;,&quot;AN_BILANT&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;AN_BILANT&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;AN_BILANT&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;),
  value = c(2013, 10, 3,2014, 11, 5, 2007, 25, 10, 2009, 23, 7)
)`

I would like to generate the following output using data.table

`output_dt &lt;- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c(&quot;OPEX&quot;, &quot;CAPEX&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;),
AN_BILANT = c(2013, 2013, 2014, 2014, 2007, 2007, 2009) 
  value = c( 10, 3,11, 5, 25, 10,23, 7)
)
`

I initially tried the following:

dcast(dt[metric == "AN_BILANT"], firm_id ~ metric, value.var = "value", fun.aggregate = function(x) x)

but I get the following error

> Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

I also tried

dcast.data.table(dt[, N:=1:.N, metric], firm_id~metric, subset = (metric=="AN_BILANT") )
Here I get the warning

> Aggregate function missing, defaulting to 'length'

答案1

得分: 5

我喜欢akrun的方法，但如果数据迫使您选择另一种方法（例如，在数据的顺序对cumsum敏感度太高的情况下），您可以尝试像下面这样使用dcast/melt方法。请注意，由于firm_id不仅包含每个metric的一个，我们需要临时添加另一个变量，以便在初始的dcast期间不过度减少。

library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
      firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |
  melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
#    firm_id AN_BILANT metric value
#      <num>     <num> <fctr> <num>
# 1:       1      2013  CAPEX     3
# 2:       1      2014  CAPEX     5
# 3:       2      2007  CAPEX    10
# 4:       2      2009  CAPEX     7
# 5:       1      2013   OPEX    10
# 6:       1      2014   OPEX    11
# 7:       2      2007   OPEX    25
# 8:       2      2009   OPEX    23

请注意，行的顺序可能不同，但双重重塑通常不能保证相同。

英文:

I like akrun's approach, but if the data forces you to choose an alternative (in case cumsum is too sensitive to ordering of data), you can try a dcast/melt approach like below. Note that since firm_id does not contain just one of each metric, we need to add another variable temporarily so that we don't over-reduce during the initial dcast.

library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
      firm_id + grp ~ metric, value.var = &quot;value&quot;)[, grp := NULL] |&gt;
  melt(c(&quot;firm_id&quot;, &quot;AN_BILANT&quot;), variable.name = &quot;metric&quot;)
#    firm_id AN_BILANT metric value
#      &lt;num&gt;     &lt;num&gt; &lt;fctr&gt; &lt;num&gt;
# 1:       1      2013  CAPEX     3
# 2:       1      2014  CAPEX     5
# 3:       2      2007  CAPEX    10
# 4:       2      2009  CAPEX     7
# 5:       1      2013   OPEX    10
# 6:       1      2014   OPEX    11
# 7:       2      2007   OPEX    25
# 8:       2      2009   OPEX    23

Granted, the order of the rows is not the same, but double-reshaping typically does not guarantee that.

答案2

得分: 4

我们可以通过在逻辑向量的累积和分组后将第一个 'value' 赋值给 'AN_BILANT' 来创建 'AN_BILANT'，然后删除这些 'AN_BILANT' 行。

library(data.table)
dt[, AN_BILANT :=  value[1], cumsum(metric == 'AN_BILANT')][
    metric != 'AN_BILANT']

输出

    firm_id metric value AN_BILANT
1:       1   OPEX    10      2013
2:       1  CAPEX     3      2013
3:       1   OPEX    11      2014
4:       1  CAPEX     5      2014
5:       2   OPEX    25      2007
6:       2  CAPEX    10      2007
7:       2   OPEX    23      2009
8:       2  CAPEX     7      2009

英文:

We may create the 'AN_BILANT' by assiging (:=) the first 'value' after grouping by the cumulative sum of a logical vector and then remove those 'AN_BILANT' rows

library(data.table)
dt[, AN_BILANT :=  value[1], cumsum(metric == &#39;AN_BILANT&#39;)][
    metric != &#39;AN_BILANT&#39;]

-output

    firm_id metric value AN_BILANT
1:       1   OPEX    10      2013
2:       1  CAPEX     3      2013
3:       1   OPEX    11      2014
4:       1  CAPEX     5      2014
5:       2   OPEX    25      2007
6:       2  CAPEX    10      2007
7:       2   OPEX    23      2009
8:       2  CAPEX     7      2009

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Subsetting a long-data.table using values of a column within the data.table and casting the other values

问题

答案1

答案2

Sweave, Shiny: 无法在服务器上生成 PDF。

在R中，对分组后的列进行编号。

你可以根据两列中是否有缺失值从数据框中删除行。

找到具有最高值的列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。