英文:
Why is as.factor() in R so slow and can it be improved?
问题
"我最近发现,as.factor() 函数的运行速度非常慢,特别是在具有长字符字符串的字符向量上。这似乎在dplyr::mutate()语句内部是一个特别的问题;在mutate()语句之外的向量上的操作似乎要快得多。是否有一种方法可以提高这个函数的性能或替代一个更快的函数?"
英文:
I've recently discovered that as.factor() operates very slowly, particularly on character vectors with long character strings. This seems to be a particular problem inside dplyr::mutate() statements; operation on vectors outside of mutate() statements seems to be much faster. Is there some way to speed up performance of this function or to substitute a faster one?
答案1
得分: 1
使用factor()函数的"levels"参数会快得多。关键在于使用"levels"参数;没有它,factor()的速度与as.factor()一样慢。以下是一个示例:
require(microbenchmark)
require(tidyverse)
# 生成一个由22个字符长的随机字符串组成的向量,包含数字字符
random_char_vec = sprintf("%022.0f", runif(1e7)*1e22)
# 放入一个tibble中
random_num_tibble = tibble(random_char_vec = random_char_vec)
# 问题似乎出现在字符字符串很长的情况下;
# 如果random_char_vec的每个元素只有五个字符,
# 那么这几乎不需要时间;
# 但在22位数的情况下,需要超过两分钟。
microbenchmark(
{
factor_random_num = as.factor(random_char_vec)
},
times=1)
Unit: seconds
expr min lq mean
{ factor_random_num = as.factor(random_char_vec) } 146.2098 146.2098 146.2098
median uq max neval
146.2098 146.2098 146.2098 1
# 这需要两秒钟。
microbenchmark(
{
factor_random_num = factor(random_char_vec, levels = unique(random_char_vec))
},
times=1)
Unit: seconds
expr
{ factor_random_num = factor(random_char_vec, levels = unique(random_char_vec)) }
min lq mean median uq max neval
1.796813 1.796813 1.796813 1.796813 1.796813 1.796813 1
# 提速的关键是预先计算levels;没有设置levels,就不会提速。
microbenchmark(
{
factor_random_num = factor(random_char_vec)
},
times=1)
Unit: seconds
expr min lq mean median
{ factor_random_num = factor(random_char_vec) } 123.8821 123.8821 123.8821 123.8821
uq max neval
123.8821 123.8821 1
我不知道为什么将factor()调用放在mutate()语句中有时会导致性能如此慢。
希望这能帮助遇到相同问题的人!
英文:
Using the factor() function with the "levels" argument is much faster. Using the "levels" argument is key; without it, factor() is just as slow as as.factor(). An example is below:
require(microbenchmark)
require(tidyverse)
#Generate a random vector of 22-character long strings consisting of numeric characters
random_char_vec = sprintf("%022.0f", runif(1e7)*1e22)
#Put it into a tibble
random_num_tibble = tibble(random_char_vec = random_char_vec)
#The problem seems to be when the character string is very long;
#if each element of random_char_vec is only five characters
#this takes no time at all;
#at 22 digits it takes over two minutes.
microbenchmark(
{
factor_random_num = as.factor(random_char_vec)
},
times=1)
Unit: seconds
expr min lq mean
{ factor_random_num = as.factor(random_char_vec) } 146.2098 146.2098 146.2098
median uq max neval
146.2098 146.2098 146.2098 1
#This takes two seconds.
microbenchmark(
{
factor_random_num = factor(random_char_vec, levels = unique(random_char_vec))
},
times=1)
Unit: seconds
expr
{ factor_random_num = factor(random_char_vec, levels = unique(random_char_vec)) }
min lq mean median uq max neval
1.796813 1.796813 1.796813 1.796813 1.796813 1.796813 1
#The key to the speedup is precomputing the levels; without setting levels, no speedup.
microbenchmark(
{
factor_random_num = factor(random_char_vec)
},
times=1)
Unit: seconds
expr min lq mean median
{ factor_random_num = factor(random_char_vec) } 123.8821 123.8821 123.8821 123.8821
uq max neval
123.8821 123.8821 1
I don't know why putting the call to factor() inside the mutate() statement sometimes leads to such slow performance.
Hope this helps someone encountering the same issue!
答案2
得分: 0
tl;dr 基本函数通常可以通过减少操作来加速,但往往不值得努力。 data.table
提供了一个替代 as.factor
的选项。
通常,如果函数体(控制台中不带括号的 your_func
)没有像 .Primitive
或 .Internal
这样的字样,那么可能可以实现显著的加速。问题是,是否值得努力?
一个基本情况是,长度为 1 亿的 200 级因子,在我的机器上,基本的 as.factor
需要大约 3 秒。
library(data.table)
library(stringi)
s <- stri_rand_strings(200, 20)
set.seed(1)
chr <- sample(s, 1e8, TRUE)
dt <- data.table(chr = chr)
bench::mark(
transform(dt, chr = as.factor(chr)),
dplyr::mutate(dt, chr = as.factor(chr)),
check = FALSE
)
切换到 data.table
方法,使用(不建议使用的)包内部函数,可以将时间大致减半。
start = proc.time()
dt[,chr := data.table:::as_factor(chr)]
timetaken(start)
除非经常运行,否则转换不值得花费输入的时间。
英文:
tl;dr base functions can often be sped up by doing less, but it is often not worth the effort. data.table
provides an alternative to as.factor
.
Generally, if the function body, (your_func
without parentheses in the console) does not say something like .Primitive
, or .Internal
then significant speed ups are possible. The question is, is it worth the effort?
A base case, 200 level factor of length 100 million, base as.factor
takes ~3 seconds on my machine.
library(data.table)
library(stringi)
s <- stri_rand_strings(200, 20)
set.seed(1)
chr <- sample(s, 1e8, TRUE)
dt <- data.table(chr = chr)
bench::mark(
transform(dt, chr = as.factor(chr)),
dplyr::mutate(dt, chr = as.factor(chr)),
check = FALSE
)
Switching to a data.table
approac, using (inadvisable) a package internal function, this can be roughly cut in half.
start = proc.time()
dt[,chr := data.table:::as_factor(chr)]
timetaken(start)
Unless it is frequently run, the conversion is not worth the time it takes to type it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论