Subsetting a long-data.table using values of a column within the data.table and casting the other values

huangapple go评论87阅读模式
英文:

Subsetting a long-data.table using values of a column within the data.table and casting the other values

问题

你可以尝试以下代码来生成你想要的输出:

  1. output_dt <- dcast(dt, firm_id + metric ~ rowid(metric), value.var = "value")
  2. output_dt <- dcast(output_dt, firm_id + 2 ~ metric, value.var = "value")
  3. colnames(output_dt) <- c("firm_id", "NA", "AN_BILANT", "CAPEX", "OPEX")
  4. output_dt[, NA := NULL]

这段代码首先使用dcast函数创建一个临时数据表output_dt,然后再次使用dcast函数将数据表进行重塑,最后删除不需要的列。这样,你将得到你期望的输出。

请注意,为了实现这一目标,我在代码中使用了一些技巧,如引入一个临时列"NA"来占位,然后删除它,以得到与你期望的输出一致的列名。

英文:

I have a 22 million observation rows data table of the following form:

  1. `dt &lt;- data.table(
  2. firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  3. metric = c(&quot;AN_BILANT&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;,&quot;AN_BILANT&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;AN_BILANT&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;AN_BILANT&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;),
  4. value = c(2013, 10, 3,2014, 11, 5, 2007, 25, 10, 2009, 23, 7)
  5. )`

I would like to generate the following output using data.table

  1. `output_dt &lt;- data.table(
  2. firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  3. metric = c(&quot;OPEX&quot;, &quot;CAPEX&quot;,&quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;, &quot;OPEX&quot;, &quot;CAPEX&quot;),
  4. AN_BILANT = c(2013, 2013, 2014, 2014, 2007, 2007, 2009)
  5. value = c( 10, 3,11, 5, 25, 10,23, 7)
  6. )
  7. `

I initially tried the following:

dcast(dt[metric == &quot;AN_BILANT&quot;], firm_id ~ metric, value.var = &quot;value&quot;, fun.aggregate = function(x) x)

but I get the following error

> Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

I also tried

dcast.data.table(dt[, N:=1:.N, metric], firm_id~metric, subset = (metric==&quot;AN_BILANT&quot;) )

Here I get the warning

> Aggregate function missing, defaulting to 'length'

答案1

得分: 5

我喜欢akrun的方法,但如果数据迫使您选择另一种方法(例如,在数据的顺序对cumsum敏感度太高的情况下),您可以尝试像下面这样使用dcast/melt方法。请注意,由于firm_id不仅包含每个metric的一个,我们需要临时添加另一个变量,以便在初始的dcast期间不过度减少。

  1. library(data.table)
  2. dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
  3. firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |
  4. melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
  5. # firm_id AN_BILANT metric value
  6. # <num> <num> <fctr> <num>
  7. # 1: 1 2013 CAPEX 3
  8. # 2: 1 2014 CAPEX 5
  9. # 3: 2 2007 CAPEX 10
  10. # 4: 2 2009 CAPEX 7
  11. # 5: 1 2013 OPEX 10
  12. # 6: 1 2014 OPEX 11
  13. # 7: 2 2007 OPEX 25
  14. # 8: 2 2009 OPEX 23

请注意,行的顺序可能不同,但双重重塑通常不能保证相同。

英文:

I like akrun's approach, but if the data forces you to choose an alternative (in case cumsum is too sensitive to ordering of data), you can try a dcast/melt approach like below. Note that since firm_id does not contain just one of each metric, we need to add another variable temporarily so that we don't over-reduce during the initial dcast.

  1. library(data.table)
  2. dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
  3. firm_id + grp ~ metric, value.var = &quot;value&quot;)[, grp := NULL] |&gt;
  4. melt(c(&quot;firm_id&quot;, &quot;AN_BILANT&quot;), variable.name = &quot;metric&quot;)
  5. # firm_id AN_BILANT metric value
  6. # &lt;num&gt; &lt;num&gt; &lt;fctr&gt; &lt;num&gt;
  7. # 1: 1 2013 CAPEX 3
  8. # 2: 1 2014 CAPEX 5
  9. # 3: 2 2007 CAPEX 10
  10. # 4: 2 2009 CAPEX 7
  11. # 5: 1 2013 OPEX 10
  12. # 6: 1 2014 OPEX 11
  13. # 7: 2 2007 OPEX 25
  14. # 8: 2 2009 OPEX 23

Granted, the order of the rows is not the same, but double-reshaping typically does not guarantee that.

答案2

得分: 4

我们可以通过在逻辑向量的累积和分组后将第一个 'value' 赋值给 'AN_BILANT' 来创建 'AN_BILANT',然后删除这些 'AN_BILANT' 行。

  1. library(data.table)
  2. dt[, AN_BILANT := value[1], cumsum(metric == 'AN_BILANT')][
  3. metric != 'AN_BILANT']
  • 输出
  1. firm_id metric value AN_BILANT
  2. 1: 1 OPEX 10 2013
  3. 2: 1 CAPEX 3 2013
  4. 3: 1 OPEX 11 2014
  5. 4: 1 CAPEX 5 2014
  6. 5: 2 OPEX 25 2007
  7. 6: 2 CAPEX 10 2007
  8. 7: 2 OPEX 23 2009
  9. 8: 2 CAPEX 7 2009
英文:

We may create the 'AN_BILANT' by assiging (:=) the first 'value' after grouping by the cumulative sum of a logical vector and then remove those 'AN_BILANT' rows

  1. library(data.table)
  2. dt[, AN_BILANT := value[1], cumsum(metric == &#39;AN_BILANT&#39;)][
  3. metric != &#39;AN_BILANT&#39;]

-output

  1. firm_id metric value AN_BILANT
  2. 1: 1 OPEX 10 2013
  3. 2: 1 CAPEX 3 2013
  4. 3: 1 OPEX 11 2014
  5. 4: 1 CAPEX 5 2014
  6. 5: 2 OPEX 25 2007
  7. 6: 2 CAPEX 10 2007
  8. 7: 2 OPEX 23 2009
  9. 8: 2 CAPEX 7 2009

huangapple
  • 本文由 发表于 2023年2月27日 10:41:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75576365.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定