英文:
unique and unlist function alternative R
问题
我有一个庞大的数据集(超过1000万行),我需要按下面提到的方式获取计数。
为此,我将首先检查所有唯一的值,如下所示。
lev<-(unique(unlist(mtcars [,8:11])))
然后使用table函数进行计数。
as.data.frame(sapply(mtcars [,8:11],function(x)table(factor(x,levels = lev))))
但是上述方法仅适用于小数据集。大多数情况下,如果我将其用于大数据集,R会中断此命令。
有没有任何建议/方法可以提高大数据集的速度,例如使用dplyr?
英文:
I have a huge (More than 10 Million rows), and I need to get the count as mentioned below.
For this, I will check first all the unique ones like this.
lev<-(unique(unlist(mtcars[,8:11])))
Then count using the table function.
as.data.frame(sapply(mtcars[,8:11], function(x) table(factor(x, levels = lev))))
But the above will work only for small datasets. Most of the time, R will kill this command if I use it for a large dataset.
Is there any suggestion/way to improve speed for large datasets, for example, for using dplyr?
答案1
得分: 2
也许 data.table
方法适合你
首先将数据融合成长格式,然后再次转换为宽格式。这会自动获取唯一值(=行),并根据列(即变量)进行聚合(dcast.data.table 的默认 fun.aggregate)。
DT <- as.data.table(mtcars) # 或者使用 setDT(mydata)
dcast(melt(DT[,8:11], measure.vars = names(DT)[8:11]),
value ~ variable)
# value vs am gear carb
# 1: 0 18 19 0 0
# 2: 1 14 13 0 7
# 3: 2 0 0 0 10
# 4: 3 0 0 15 3
# 5: 4 0 0 12 10
# 6: 5 0 0 5 0
# 7: 6 0 0 0 1
# 8: 8 0 0 0 1
英文:
perhaps a data.table
approach might work for you
it first melts the data to a long format, and then casts to wide again. This automatically gets the unique values (=rows), and the length of these values (the default fun.aggregate for dcast.data.table) by column (i.e. variable).
DT <- as.data.table(mtcars) # or setDT(mydata)
dcast(melt(DT[,8:11], measure.vars = names(DT)[8:11]),
value ~ variable)
# value vs am gear carb
# 1: 0 18 19 0 0
# 2: 1 14 13 0 7
# 3: 2 0 0 0 10
# 4: 3 0 0 15 3
# 5: 4 0 0 12 10
# 6: 5 0 0 5 0
# 7: 6 0 0 0 1
# 8: 8 0 0 0 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论