为什么在R中的as.factor()函数如此缓慢,能否改进?

huangapple go评论70阅读模式
英文:

Why is as.factor() in R so slow and can it be improved?

问题

"我最近发现,as.factor() 函数的运行速度非常慢,特别是在具有长字符字符串的字符向量上。这似乎在dplyr::mutate()语句内部是一个特别的问题;在mutate()语句之外的向量上的操作似乎要快得多。是否有一种方法可以提高这个函数的性能或替代一个更快的函数?"

英文:

I've recently discovered that as.factor() operates very slowly, particularly on character vectors with long character strings. This seems to be a particular problem inside dplyr::mutate() statements; operation on vectors outside of mutate() statements seems to be much faster. Is there some way to speed up performance of this function or to substitute a faster one?

答案1

得分: 1

使用factor()函数的"levels"参数会快得多。关键在于使用"levels"参数;没有它,factor()的速度与as.factor()一样慢。以下是一个示例:

require(microbenchmark)
require(tidyverse)

# 生成一个由22个字符长的随机字符串组成的向量,包含数字字符
random_char_vec = sprintf("%022.0f", runif(1e7)*1e22)

# 放入一个tibble中
random_num_tibble = tibble(random_char_vec = random_char_vec)

# 问题似乎出现在字符字符串很长的情况下;
# 如果random_char_vec的每个元素只有五个字符,
# 那么这几乎不需要时间;
# 但在22位数的情况下,需要超过两分钟。

microbenchmark(
  {
    factor_random_num = as.factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                   expr      min       lq     mean
 {     factor_random_num = as.factor(random_char_vec) } 146.2098 146.2098 146.2098
   median       uq      max neval
 146.2098 146.2098 146.2098     1

# 这需要两秒钟。
microbenchmark(
  {
    factor_random_num = factor(random_char_vec, levels = unique(random_char_vec))
  },
  times=1)
Unit: seconds
                                                                                  expr
 {     factor_random_num = factor(random_char_vec, levels = unique(random_char_vec)) }
      min       lq     mean   median       uq      max neval
 1.796813 1.796813 1.796813 1.796813 1.796813 1.796813     1

# 提速的关键是预先计算levels;没有设置levels,就不会提速。
microbenchmark(
  {
    factor_random_num = factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                expr      min       lq     mean   median
 {     factor_random_num = factor(random_char_vec) } 123.8821 123.8821 123.8821 123.8821
       uq      max neval
 123.8821 123.8821     1

我不知道为什么将factor()调用放在mutate()语句中有时会导致性能如此慢。

希望这能帮助遇到相同问题的人!

英文:

Using the factor() function with the "levels" argument is much faster. Using the "levels" argument is key; without it, factor() is just as slow as as.factor(). An example is below:

require(microbenchmark)
require(tidyverse)

#Generate a random vector of 22-character long strings consisting of numeric characters
random_char_vec = sprintf("%022.0f", runif(1e7)*1e22)

#Put it into a tibble
random_num_tibble = tibble(random_char_vec = random_char_vec)

#The problem seems to be when the character string is very long; 
#if each element of random_char_vec is only five characters 
#this takes no time at all; 
#at 22 digits it takes over two minutes.

microbenchmark(
  {
    factor_random_num = as.factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                   expr      min       lq     mean
 {     factor_random_num = as.factor(random_char_vec) } 146.2098 146.2098 146.2098
   median       uq      max neval
 146.2098 146.2098 146.2098     1

#This takes two seconds.
microbenchmark(
  {
    factor_random_num = factor(random_char_vec, levels = unique(random_char_vec))
  },
  times=1)
Unit: seconds
                                                                                  expr
 {     factor_random_num = factor(random_char_vec, levels = unique(random_char_vec)) }
      min       lq     mean   median       uq      max neval
 1.796813 1.796813 1.796813 1.796813 1.796813 1.796813     1

#The key to the speedup is precomputing the levels; without setting levels, no speedup.
microbenchmark(
  {
    factor_random_num = factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                expr      min       lq     mean   median
 {     factor_random_num = factor(random_char_vec) } 123.8821 123.8821 123.8821 123.8821
       uq      max neval
 123.8821 123.8821     1

I don't know why putting the call to factor() inside the mutate() statement sometimes leads to such slow performance.

Hope this helps someone encountering the same issue!

答案2

得分: 0

tl;dr 基本函数通常可以通过减少操作来加速,但往往不值得努力。 data.table 提供了一个替代 as.factor 的选项。

通常,如果函数体(控制台中不带括号的 your_func)没有像 .Primitive.Internal 这样的字样,那么可能可以实现显著的加速。问题是,是否值得努力?

一个基本情况是,长度为 1 亿的 200 级因子,在我的机器上,基本的 as.factor 需要大约 3 秒。

library(data.table)
library(stringi)
s <- stri_rand_strings(200, 20)
set.seed(1)
chr <- sample(s, 1e8, TRUE)
dt <- data.table(chr = chr)

bench::mark(
  transform(dt, chr = as.factor(chr)),
  dplyr::mutate(dt, chr = as.factor(chr)), 
  check = FALSE
)

切换到 data.table 方法,使用(不建议使用的)包内部函数,可以将时间大致减半。

start = proc.time()
dt[,chr := data.table:::as_factor(chr)]
timetaken(start)

除非经常运行,否则转换不值得花费输入的时间。

英文:

tl;dr base functions can often be sped up by doing less, but it is often not worth the effort. data.table provides an alternative to as.factor.

Generally, if the function body, (your_func without parentheses in the console) does not say something like .Primitive, or .Internal then significant speed ups are possible. The question is, is it worth the effort?

A base case, 200 level factor of length 100 million, base as.factor takes ~3 seconds on my machine.

library(data.table)
library(stringi)
s <- stri_rand_strings(200, 20)
set.seed(1)
chr <- sample(s, 1e8, TRUE)
dt <- data.table(chr = chr)

bench::mark(
  transform(dt, chr = as.factor(chr)),
  dplyr::mutate(dt, chr = as.factor(chr)), 
  check = FALSE
)

Switching to a data.table approac, using (inadvisable) a package internal function, this can be roughly cut in half.

start = proc.time()
dt[,chr := data.table:::as_factor(chr)]
timetaken(start)

Unless it is frequently run, the conversion is not worth the time it takes to type it.

huangapple
  • 本文由 发表于 2023年6月12日 12:12:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76453609.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定