2023年6月12日 12:12:47go评论98阅读模式

英文:

Why is as.factor() in R so slow and can it be improved?

问题

"我最近发现，as.factor() 函数的运行速度非常慢，特别是在具有长字符字符串的字符向量上。这似乎在dplyr::mutate()语句内部是一个特别的问题；在mutate()语句之外的向量上的操作似乎要快得多。是否有一种方法可以提高这个函数的性能或替代一个更快的函数？"

英文:

I've recently discovered that as.factor() operates very slowly, particularly on character vectors with long character strings. This seems to be a particular problem inside dplyr::mutate() statements; operation on vectors outside of mutate() statements seems to be much faster. Is there some way to speed up performance of this function or to substitute a faster one?

答案1

得分: 1

使用factor()函数的"levels"参数会快得多。关键在于使用"levels"参数；没有它，factor()的速度与as.factor()一样慢。以下是一个示例：

require(microbenchmark)
require(tidyverse)
# 生成一个由22个字符长的随机字符串组成的向量，包含数字字符
random_char_vec = sprintf("%022.0f", runif(1e7)*1e22)
# 放入一个tibble中
random_num_tibble = tibble(random_char_vec = random_char_vec)
# 问题似乎出现在字符字符串很长的情况下；
# 如果random_char_vec的每个元素只有五个字符，
# 那么这几乎不需要时间；
# 但在22位数的情况下，需要超过两分钟。
microbenchmark(
  {
    factor_random_num = as.factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                   expr      min       lq     mean
 {     factor_random_num = as.factor(random_char_vec) } 146.2098 146.2098 146.2098
   median       uq      max neval
 146.2098 146.2098 146.2098     1
# 这需要两秒钟。
microbenchmark(
  {
    factor_random_num = factor(random_char_vec, levels = unique(random_char_vec))
  },
  times=1)
Unit: seconds
                                                                                  expr
 {     factor_random_num = factor(random_char_vec, levels = unique(random_char_vec)) }
      min       lq     mean   median       uq      max neval
 1.796813 1.796813 1.796813 1.796813 1.796813 1.796813     1
# 提速的关键是预先计算levels；没有设置levels，就不会提速。
microbenchmark(
  {
    factor_random_num = factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                expr      min       lq     mean   median
 {     factor_random_num = factor(random_char_vec) } 123.8821 123.8821 123.8821 123.8821
       uq      max neval
 123.8821 123.8821     1

我不知道为什么将factor()调用放在mutate()语句中有时会导致性能如此慢。

希望这能帮助遇到相同问题的人！

英文:

Using the factor() function with the "levels" argument is much faster. Using the "levels" argument is key; without it, factor() is just as slow as as.factor(). An example is below:

require(microbenchmark)
require(tidyverse)
#Generate a random vector of 22-character long strings consisting of numeric characters
random_char_vec = sprintf(&quot;%022.0f&quot;, runif(1e7)*1e22)
#Put it into a tibble
random_num_tibble = tibble(random_char_vec = random_char_vec)
#The problem seems to be when the character string is very long; 
#if each element of random_char_vec is only five characters 
#this takes no time at all; 
#at 22 digits it takes over two minutes.
microbenchmark(
  {
    factor_random_num = as.factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                   expr      min       lq     mean
 {     factor_random_num = as.factor(random_char_vec) } 146.2098 146.2098 146.2098
   median       uq      max neval
 146.2098 146.2098 146.2098     1
#This takes two seconds.
microbenchmark(
  {
    factor_random_num = factor(random_char_vec, levels = unique(random_char_vec))
  },
  times=1)
Unit: seconds
                                                                                  expr
 {     factor_random_num = factor(random_char_vec, levels = unique(random_char_vec)) }
      min       lq     mean   median       uq      max neval
 1.796813 1.796813 1.796813 1.796813 1.796813 1.796813     1
#The key to the speedup is precomputing the levels; without setting levels, no speedup.
microbenchmark(
  {
    factor_random_num = factor(random_char_vec)
  },
  times=1)
Unit: seconds
                                                expr      min       lq     mean   median
 {     factor_random_num = factor(random_char_vec) } 123.8821 123.8821 123.8821 123.8821
       uq      max neval
 123.8821 123.8821     1

I don't know why putting the call to factor() inside the mutate() statement sometimes leads to such slow performance.

Hope this helps someone encountering the same issue!

答案2

得分: 0

tl;dr 基本函数通常可以通过减少操作来加速，但往往不值得努力。 data.table 提供了一个替代 as.factor 的选项。

通常，如果函数体（控制台中不带括号的 your_func）没有像 .Primitive 或 .Internal 这样的字样，那么可能可以实现显著的加速。问题是，是否值得努力？

一个基本情况是，长度为 1 亿的 200 级因子，在我的机器上，基本的 as.factor 需要大约 3 秒。

library(data.table)
library(stringi)
s &lt;- stri_rand_strings(200, 20)
set.seed(1)
chr &lt;- sample(s, 1e8, TRUE)
dt &lt;- data.table(chr = chr)
bench::mark(
  transform(dt, chr = as.factor(chr)),
  dplyr::mutate(dt, chr = as.factor(chr)), 
  check = FALSE
)

切换到 data.table 方法，使用（不建议使用的）包内部函数，可以将时间大致减半。

start = proc.time()
dt[,chr := data.table:::as_factor(chr)]
timetaken(start)

除非经常运行，否则转换不值得花费输入的时间。

英文:

tl;dr base functions can often be sped up by doing less, but it is often not worth the effort. data.table provides an alternative to as.factor.

Generally, if the function body, (your_func without parentheses in the console) does not say something like .Primitive, or .Internal then significant speed ups are possible. The question is, is it worth the effort?

A base case, 200 level factor of length 100 million, base as.factor takes ~3 seconds on my machine.

library(data.table)
library(stringi)
s &lt;- stri_rand_strings(200, 20)
set.seed(1)
chr &lt;- sample(s, 1e8, TRUE)
dt &lt;- data.table(chr = chr)
bench::mark(
  transform(dt, chr = as.factor(chr)),
  dplyr::mutate(dt, chr = as.factor(chr)), 
  check = FALSE
)

Switching to a data.table approac, using (inadvisable) a package internal function, this can be roughly cut in half.

start = proc.time()
dt[,chr := data.table:::as_factor(chr)]
timetaken(start)

Unless it is frequently run, the conversion is not worth the time it takes to type it.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么在R中的as.factor()函数如此缓慢，能否改进？

问题

答案1

答案2

使用R中的sf合并一些多边形边界。

dbGetQuery使用多个参数

如何在2-way facet_grid中删除空面板？

在R中如何创建一个函数，只有在达到阈值时才调用一个名称？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。