Subsetting a long-data.table using values of a column within the data.table and casting the other values

huangapple go评论64阅读模式
英文:

Subsetting a long-data.table using values of a column within the data.table and casting the other values

问题

你可以尝试以下代码来生成你想要的输出:

output_dt <- dcast(dt, firm_id + metric ~ rowid(metric), value.var = "value")
output_dt <- dcast(output_dt, firm_id + 2 ~ metric, value.var = "value")
colnames(output_dt) <- c("firm_id", "NA", "AN_BILANT", "CAPEX", "OPEX")
output_dt[, NA := NULL]

这段代码首先使用dcast函数创建一个临时数据表output_dt,然后再次使用dcast函数将数据表进行重塑,最后删除不需要的列。这样,你将得到你期望的输出。

请注意,为了实现这一目标,我在代码中使用了一些技巧,如引入一个临时列"NA"来占位,然后删除它,以得到与你期望的输出一致的列名。

英文:

I have a 22 million observation rows data table of the following form:

`dt &lt;- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c(&quot;AN_BILANT&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;,&quot;AN_BILANT&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;AN_BILANT&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;AN_BILANT&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;),
  value = c(2013, 10, 3,2014, 11, 5, 2007, 25, 10, 2009, 23, 7)
)`

I would like to generate the following output using data.table

`output_dt &lt;- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c(&quot;OPEX&quot;, &quot;CAPEX&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;),
AN_BILANT = c(2013, 2013, 2014, 2014, 2007, 2007, 2009) 
  value = c( 10, 3,11, 5, 25, 10,23, 7)
)
`

I initially tried the following:

dcast(dt[metric == &quot;AN_BILANT&quot;], firm_id ~ metric, value.var = &quot;value&quot;, fun.aggregate = function(x) x)

but I get the following error

> Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

I also tried

dcast.data.table(dt[, N:=1:.N, metric], firm_id~metric, subset = (metric==&quot;AN_BILANT&quot;) )

Here I get the warning

> Aggregate function missing, defaulting to 'length'

答案1

得分: 5

我喜欢akrun的方法,但如果数据迫使您选择另一种方法(例如,在数据的顺序对cumsum敏感度太高的情况下),您可以尝试像下面这样使用dcast/melt方法。请注意,由于firm_id不仅包含每个metric的一个,我们需要临时添加另一个变量,以便在初始的dcast期间不过度减少。

library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
      firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |
  melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
#    firm_id AN_BILANT metric value
#      <num>     <num> <fctr> <num>
# 1:       1      2013  CAPEX     3
# 2:       1      2014  CAPEX     5
# 3:       2      2007  CAPEX    10
# 4:       2      2009  CAPEX     7
# 5:       1      2013   OPEX    10
# 6:       1      2014   OPEX    11
# 7:       2      2007   OPEX    25
# 8:       2      2009   OPEX    23

请注意,行的顺序可能不同,但双重重塑通常不能保证相同。

英文:

I like akrun's approach, but if the data forces you to choose an alternative (in case cumsum is too sensitive to ordering of data), you can try a dcast/melt approach like below. Note that since firm_id does not contain just one of each metric, we need to add another variable temporarily so that we don't over-reduce during the initial dcast.

library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
      firm_id + grp ~ metric, value.var = &quot;value&quot;)[, grp := NULL] |&gt;
  melt(c(&quot;firm_id&quot;, &quot;AN_BILANT&quot;), variable.name = &quot;metric&quot;)
#    firm_id AN_BILANT metric value
#      &lt;num&gt;     &lt;num&gt; &lt;fctr&gt; &lt;num&gt;
# 1:       1      2013  CAPEX     3
# 2:       1      2014  CAPEX     5
# 3:       2      2007  CAPEX    10
# 4:       2      2009  CAPEX     7
# 5:       1      2013   OPEX    10
# 6:       1      2014   OPEX    11
# 7:       2      2007   OPEX    25
# 8:       2      2009   OPEX    23

Granted, the order of the rows is not the same, but double-reshaping typically does not guarantee that.

答案2

得分: 4

我们可以通过在逻辑向量的累积和分组后将第一个 'value' 赋值给 'AN_BILANT' 来创建 'AN_BILANT',然后删除这些 'AN_BILANT' 行。

library(data.table)
dt[, AN_BILANT :=  value[1], cumsum(metric == 'AN_BILANT')][
    metric != 'AN_BILANT']
  • 输出
    firm_id metric value AN_BILANT
1:       1   OPEX    10      2013
2:       1  CAPEX     3      2013
3:       1   OPEX    11      2014
4:       1  CAPEX     5      2014
5:       2   OPEX    25      2007
6:       2  CAPEX    10      2007
7:       2   OPEX    23      2009
8:       2  CAPEX     7      2009
英文:

We may create the 'AN_BILANT' by assiging (:=) the first 'value' after grouping by the cumulative sum of a logical vector and then remove those 'AN_BILANT' rows

library(data.table)
dt[, AN_BILANT :=  value[1], cumsum(metric == &#39;AN_BILANT&#39;)][
    metric != &#39;AN_BILANT&#39;]

-output

    firm_id metric value AN_BILANT
1:       1   OPEX    10      2013
2:       1  CAPEX     3      2013
3:       1   OPEX    11      2014
4:       1  CAPEX     5      2014
5:       2   OPEX    25      2007
6:       2  CAPEX    10      2007
7:       2   OPEX    23      2009
8:       2  CAPEX     7      2009

huangapple
  • 本文由 发表于 2023年2月27日 10:41:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75576365.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定