问题

我有一个庞大的数据集（超过1000万行），我需要按下面提到的方式获取计数。

为此，我将首先检查所有唯一的值，如下所示。

 lev<-（unique（unlist（mtcars [，8:11]）））

然后使用table函数进行计数。

 as.data.frame（sapply（mtcars [，8:11]，function（x）table（factor（x，levels = lev））））

但是上述方法仅适用于小数据集。大多数情况下，如果我将其用于大数据集，R会中断此命令。

有没有任何建议/方法可以提高大数据集的速度，例如使用dplyr？

英文:

I have a huge (More than 10 Million rows), and I need to get the count as mentioned below.

For this, I will check first all the unique ones like this.

 lev&lt;-(unique(unlist(mtcars[,8:11])))

Then count using the table function.

 as.data.frame(sapply(mtcars[,8:11], function(x) table(factor(x, levels = lev))))

But the above will work only for small datasets. Most of the time, R will kill this command if I use it for a large dataset.

Is there any suggestion/way to improve speed for large datasets, for example, for using dplyr?

答案1

得分: 2

也许 data.table 方法适合你

首先将数据融合成长格式，然后再次转换为宽格式。这会自动获取唯一值（=行），并根据列（即变量）进行聚合（dcast.data.table 的默认 fun.aggregate）。

DT <- as.data.table(mtcars)  # 或者使用 setDT(mydata)
dcast(melt(DT[,8:11], measure.vars = names(DT)[8:11]),
      value ~ variable)
#    value vs am gear carb
# 1:     0 18 19    0    0
# 2:     1 14 13    0    7
# 3:     2  0  0    0   10
# 4:     3  0  0   15    3
# 5:     4  0  0   12   10
# 6:     5  0  0    5    0
# 7:     6  0  0    0    1
# 8:     8  0  0    0    1

英文:

perhaps a data.table approach might work for you

it first melts the data to a long format, and then casts to wide again. This automatically gets the unique values (=rows), and the length of these values (the default fun.aggregate for dcast.data.table) by column (i.e. variable).

DT &lt;- as.data.table(mtcars)  # or setDT(mydata)
dcast(melt(DT[,8:11], measure.vars = names(DT)[8:11]),
      value ~ variable)
#    value vs am gear carb
# 1:     0 18 19    0    0
# 2:     1 14 13    0    7
# 3:     2  0  0    0   10
# 4:     3  0  0   15    3
# 5:     4  0  0   12   10
# 6:     5  0  0    5    0
# 7:     6  0  0    0    1
# 8:     8  0  0    0    1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R中的”unique”和”unlist”函数的替代方法

问题

答案1

怎样获取在Plotly中绘制图形的坐标。

改变闪亮选项卡的颜色，取决于另一个选项卡是否处于活动状态。

何时初始化会改变（R）KFAS包中的结果？

在Reddit上按时间段统计字符串提及次数（dplyr）。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论