英文:
Subsetting a long-data.table using values of a column within the data.table and casting the other values
问题
你可以尝试以下代码来生成你想要的输出:
output_dt <- dcast(dt, firm_id + metric ~ rowid(metric), value.var = "value")
output_dt <- dcast(output_dt, firm_id + 2 ~ metric, value.var = "value")
colnames(output_dt) <- c("firm_id", "NA", "AN_BILANT", "CAPEX", "OPEX")
output_dt[, NA := NULL]
这段代码首先使用dcast函数创建一个临时数据表output_dt,然后再次使用dcast函数将数据表进行重塑,最后删除不需要的列。这样,你将得到你期望的输出。
请注意,为了实现这一目标,我在代码中使用了一些技巧,如引入一个临时列"NA"来占位,然后删除它,以得到与你期望的输出一致的列名。
英文:
I have a 22 million observation rows data table of the following form:
`dt <- data.table(
firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
metric = c("AN_BILANT", "OPEX", "CAPEX","AN_BILANT","OPEX", "CAPEX", "AN_BILANT", "OPEX", "CAPEX", "AN_BILANT","OPEX", "CAPEX"),
value = c(2013, 10, 3,2014, 11, 5, 2007, 25, 10, 2009, 23, 7)
)`
I would like to generate the following output using data.table
`output_dt <- data.table(
firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
metric = c("OPEX", "CAPEX","OPEX", "CAPEX", "OPEX", "CAPEX", "OPEX", "CAPEX"),
AN_BILANT = c(2013, 2013, 2014, 2014, 2007, 2007, 2009)
value = c( 10, 3,11, 5, 25, 10,23, 7)
)
`
I initially tried the following:
dcast(dt[metric == "AN_BILANT"], firm_id ~ metric, value.var = "value", fun.aggregate = function(x) x)
but I get the following error
> Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.
I also tried
dcast.data.table(dt[, N:=1:.N, metric], firm_id~metric, subset = (metric=="AN_BILANT") )
Here I get the warning
> Aggregate function missing, defaulting to 'length'
答案1
得分: 5
我喜欢akrun的方法,但如果数据迫使您选择另一种方法(例如,在数据的顺序对cumsum
敏感度太高的情况下),您可以尝试像下面这样使用dcast/melt方法。请注意,由于firm_id
不仅包含每个metric
的一个,我们需要临时添加另一个变量,以便在初始的dcast
期间不过度减少。
library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |
melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
# firm_id AN_BILANT metric value
# <num> <num> <fctr> <num>
# 1: 1 2013 CAPEX 3
# 2: 1 2014 CAPEX 5
# 3: 2 2007 CAPEX 10
# 4: 2 2009 CAPEX 7
# 5: 1 2013 OPEX 10
# 6: 1 2014 OPEX 11
# 7: 2 2007 OPEX 25
# 8: 2 2009 OPEX 23
请注意,行的顺序可能不同,但双重重塑通常不能保证相同。
英文:
I like akrun's approach, but if the data forces you to choose an alternative (in case cumsum
is too sensitive to ordering of data), you can try a dcast/melt approach like below. Note that since firm_id
does not contain just one of each metric
, we need to add another variable temporarily so that we don't over-reduce during the initial dcast
.
library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |>
melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
# firm_id AN_BILANT metric value
# <num> <num> <fctr> <num>
# 1: 1 2013 CAPEX 3
# 2: 1 2014 CAPEX 5
# 3: 2 2007 CAPEX 10
# 4: 2 2009 CAPEX 7
# 5: 1 2013 OPEX 10
# 6: 1 2014 OPEX 11
# 7: 2 2007 OPEX 25
# 8: 2 2009 OPEX 23
Granted, the order of the rows is not the same, but double-reshaping typically does not guarantee that.
答案2
得分: 4
我们可以通过在逻辑向量的累积和分组后将第一个 'value' 赋值给 'AN_BILANT' 来创建 'AN_BILANT',然后删除这些 'AN_BILANT' 行。
library(data.table)
dt[, AN_BILANT := value[1], cumsum(metric == 'AN_BILANT')][
metric != 'AN_BILANT']
- 输出
firm_id metric value AN_BILANT
1: 1 OPEX 10 2013
2: 1 CAPEX 3 2013
3: 1 OPEX 11 2014
4: 1 CAPEX 5 2014
5: 2 OPEX 25 2007
6: 2 CAPEX 10 2007
7: 2 OPEX 23 2009
8: 2 CAPEX 7 2009
英文:
We may create the 'AN_BILANT' by assiging (:=
) the first 'value' after grouping by the cumulative sum of a logical vector and then remove those 'AN_BILANT' rows
library(data.table)
dt[, AN_BILANT := value[1], cumsum(metric == 'AN_BILANT')][
metric != 'AN_BILANT']
-output
firm_id metric value AN_BILANT
1: 1 OPEX 10 2013
2: 1 CAPEX 3 2013
3: 1 OPEX 11 2014
4: 1 CAPEX 5 2014
5: 2 OPEX 25 2007
6: 2 CAPEX 10 2007
7: 2 OPEX 23 2009
8: 2 CAPEX 7 2009
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论