2023年6月26日 12:00:52go评论130阅读模式

英文:

Memory issues when obtaining TF-IDF data

问题

Intro

我正在处理一个庞大的推文数据集的文本分类问题，如果有人能指导我正确的方向，我将不胜感激。

总体情况是，我需要训练一个分类器，以区分一个庞大数据集中的两个类别（最多有600万个文本）。我一直在使用_recipes_框架进行处理，然后通过_tidymodels_运行glmnet lasso。具体问题是，在计算tf-idf时，我内存不足。

Question

我应该如何努力解决这个问题？我可以基本上手动分批处理以获取所有tf-idf值，然后再手动将它们组合成稀疏矩阵对象。这听起来很繁琐，而且肯定有人在之前遇到并解决了这个问题。另一个选择是使用Spark，但这远超出了我目前的能力范围，可能对一次性任务来说过于复杂。或者我可能遗漏了某些东西，现有工具是否能够胜任这个任务？

具体来说，当运行以下代码时（变量应该是不言自明的，但我稍后会提供完整的可复现代码）：

recipe <-
  recipe(Class ~ text, data = corpus) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = m) %>%
  step_tfidf(text) %>%
  prep()

如果corpus太大或m太大，Rstudio会崩溃。如果它们是中等大小，它会抛出一个警告：

In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.2 GiB

我在网上找不到相关信息，也不理解这是什么意思。为什么它要将稀疏的东西转换成密集的？这肯定会对任何大型数据集造成麻烦。我是不是做错了什么？如果这是可以预防的，也许我在处理完整数据集时会更顺利？

或者step_tfidf无法处理600万个观察值和不限制最大标记数的情况，这里是否无望？

P.S. tm和tidytext甚至无法开始解决这个问题。

Full Code

我将提供一个可复现的示例，演示我试图做的事情。这段代码设置了一个包含随机单词的推文文本语料库，每个文本的长度超过5百万字符：

library(tidymodels)
library(dplyr)
library(stringr)
library(textrecipes)
library(hardhat)
url <- "https://gutenberg.org/cache/epub/2701/pg2701-images.html"
words <- readLines(url, encoding = "UTF-8") %>% str_extract_all('\\w+\\b') %>% unlist()
x <- rnorm(n = 6000000, mean = 18, sd = 14)
x <- x[x > 0]
corpus <- 
  lapply(x, function(i) {
    c('text' = paste(sample(words, size = i, replace = TRUE), collapse = ' '))
  }) %>% 
  bind_rows() %>% 
  mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))

所以corpus看起来像这样：

> corpus
# A tibble: 5,402,638 × 3
   text                                                                      ID Class
   <chr>                                                                  <int> <fct>
 1 included Fast at can aghast me some as article and ship things is          1 1    
 2 him to quantity while became man was childhood it that Who in on his the is 2 1    
 3 no There a pass are it in evangelical rather in direst the in a even r... 3 0    
 4 this would against his You disappeared have summit the vagrant in fine... 4 1    
 5 slippery the Judge ever life Moby But i will after sounding ship like p... 5 1    
 6 at can hope running                                                      6 1    
 7 Jeroboam even there slow though thought though I flukes yarn swore cal... 7 1    
 8 not if rocks ever lantern go last though at you white his that remains... 8 1    
 9 Nostril as p full the furnish are nor made towards except bivouacks p ... 9 1    
10 and p multitudinously body Archive fifty was of Greenland                10 0    
# ℹ 5,402,628 more rows
# ℹ Use `print(n = ...)` to see more rows

它本身占用大约1GB的RAM。

我执行了标准的建模工作流程，我将在这里完整呈现，只是为了提供完整的信息：

# 准备
corpus_split <- initial_split(corpus, strata = Class) # 分割
corpus_train <- training(corpus_split)
corpus_test <- testing(corpus_split)
folds <- vfold_cv(corpus_train) # k-fold cv 准备
sparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix") # 使用稀疏矩阵
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20) # 超参数校准
# recipe
recipe <-
  recipe(Ad ~ text, data = corpus_train) %>%
  step_tokenize(text) %>%
  step_stopwords(text, custom_stopword_source = 'twclid') %>%
  step_tokenfilter(text, max_tokens = 10000) %>%
  step_tfidf(text)
# lasso 模型
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")
# 工作流程
sparse_wf <- workflow() %>%
  add_recipe(recipe, blueprint = sparse_bp) %>%
  add_model(lasso_spec)
# 拟合
sparse_rs <- tune_grid(
  sparse_wf,
  folds,
  grid = smaller_lambda
)

英文:

Intro

I am struggling with text classification of a big dataset of tweets and I would be thankful if someone could point me in the right direction.

The big picture is that I need to train a classifier that would distinguish between two classes on a huge dataset (up to 6 million texts). I've been doing it in the recipes framework to then run glmnet lasso through tidymodels. The specific problem is that I am running out of memory when calculating tf-idf.

Question

Which way should I direct my efforts in resolving this? I could do it basically manually in batches to obtain all the tf-idf values and then again manually combine them into a sparse matrix object. This sounds anal and surely someone has had this problem before and solved it? Another option is Spark, but it is far beyond my abilities at the moment and is probably overkill for a one-time task. Or maybe I am missing something, and existing tools are capable of this?

Specifically, I am running into two kinds of problems when running the following (variables should be self-explanatory, but I will provide full reproducible code later):

recipe &lt;-
  recipe(Class ~ text, data = corpus) %&gt;% 
  step_tokenize(text) %&gt;%
  step_stopwords(text) %&gt;% 
  step_tokenfilter(text, max_tokens = m) %&gt;% 
  step_tfidf(text) %&gt;% 
  prep()

If corpus is too big or m is too large, Rstudio crashes. If they are moderately large, it throws a warning:

In asMethod(object) :
  sparse-&gt;dense coercion: allocating vector of size 1.2 GiB

I'm not finding anything about it online, and I don't understand it. Why is it trying to coerce something from sparse to dense? That surely spells trouble for any large dataset. Am I doing something wrong? If this is preventable, maybe I will have better luck with my full dataset?

Or is there no hope for step_tfidf to cope with 6m observations and no limit on max tokens?

P.S. tm and tidytext can't even begin to approach the issue.

Full Code

I'll give a reproducible example of what I am trying to do. This code sets up a corpus of tweet-long texts with random words of size 5m+:

library(tidymodels)
library(dplyr)
library(stringr)
library(textrecipes)
library(hardhat)
url &lt;- &quot;https://gutenberg.org/cache/epub/2701/pg2701-images.html&quot;
words &lt;- readLines(url, encoding = &quot;UTF-8&quot;) %&gt;% str_extract_all(&#39;\\w+\\b&#39;) %&gt;% unlist()
x &lt;- rnorm(n = 6000000, mean = 18, sd = 14)
x &lt;- x[x &gt; 0]
corpus &lt;- 
  lapply(x, function(i) {
    c(&#39;text&#39; = paste(sample(words, size = i, replace = TRUE), collapse = &#39; &#39;))
  }) %&gt;% 
  bind_rows() %&gt;% 
  mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))

So corpus looks something like this:

&gt; corpus
# A tibble: 5,402,638 &#215; 3
   text                                                                                                                                       ID Class
   &lt;chr&gt;                                                                                                                                   &lt;int&gt; &lt;fct&gt;
 1 included Fast at can aghast me some as article and ship things is                                                                           1 1    
 2 him to quantity while became man was childhood it that Who in on his the is                                                                 2 1    
 3 no There a pass are it in evangelical rather in direst the in a even reason to Yes and the this unconditional his clear other thou all…     3 0    
 4 this would against his You disappeared have summit the vagrant in fine inland is scrupulous signifies that come the the buoyed and of …     4 1    
 5 slippery the Judge ever life Moby But i will after sounding ship like p he Like                                                             5 1    
 6 at can hope running                                                                                                                         6 1    
 7 Jeroboam even there slow though thought though I flukes yarn swore called p oarsmen with sort who looked and sharks young Radney s          7 1    
 8 not if rocks ever lantern go last though at you white his that remains of primal Starbuck sans you steam up with against                    8 1    
 9 Nostril as p full the furnish are nor made towards except bivouacks p blast how never now are here of difference it whalemen s much th…     9 1    
10 and p multitudinously body Archive fifty was of Greenland                                                                                  10 0    
# ℹ 5,402,628 more rows
# ℹ Use `print(n = ...)` to see more rows

It itself is around 1 Gb of RAM.

I do the standard modeling workflow that I will present here in full just for the fullness of information.

# prep
corpus_split &lt;- initial_split(corpus, strata = Class) # split
corpus_train &lt;- training(corpus_split)
corpus_test &lt;- testing(corpus_split)
folds &lt;- vfold_cv(corpus_train) #k-fold cv prep
sparse_bp &lt;- hardhat::default_recipe_blueprint(composition = &quot;dgCMatrix&quot;) # use sparse matrices
smaller_lambda &lt;- grid_regular(penalty(range = c(-5, 0)), levels = 20) # hyperparameter calibration
# recipe
recipe &lt;-
  recipe(Ad ~ text, data = corpus_train) %&gt;% 
  step_tokenize(text) %&gt;%
  step_stopwords(text, custom_stopword_source = &#39;twclid&#39;) %&gt;% 
  step_tokenfilter(text, max_tokens = 10000) %&gt;% 
  step_tfidf(text)
# lasso model
lasso_spec &lt;- logistic_reg(penalty = tune(), mixture = 1) %&gt;% # tuning the penalty hyperparameter
  set_mode(&quot;classification&quot;) %&gt;%
  set_engine(&quot;glmnet&quot;)
# workflow
sparse_wf &lt;- workflow() %&gt;%
  add_recipe(recipe, blueprint = sparse_bp) %&gt;%
  add_model(lasso_spec)
# fit
sparse_rs &lt;- tune_grid(
  sparse_wf,
  folds,
  grid = smaller_lambda
)

答案1

得分: 5

抱歉，目前在tidymodels中，你无法做太多事情来解决你的问题。tidymodels套件的一系列包都围绕着将{tibble}s用作它们的常见数据容器。在许多情况下，这是非常有效的，但对于稀疏数据来说，情况就不一样了。

当在工作流中使用配方（recipe）时，需要将数据作为tibble传递给parsnip。这要求数据不是稀疏的，而在你的情况下，这将大幅增加数据的大小！例如，如果你有6,000,000个观测值，只有2000个不同的标记，你最终会得到96GB的数据...

这是我（我是{textrecipes}的作者，也是tidymodels团队的一名开发人员）希望在某个时候发生的事情，但目前超出了我的控制范围，因为我们需要找到一种在tibbles中使用稀疏数据的方法。

英文:

Sadly there isn't much you can do right now, within the tidymodels to solve your task. The {tidymodels} set of packages revolves around using {tibble}s as their common data vessel. This works great in many situations, expect here for sparse data.

When a recipe is used in a workflow, it is required to hand off the data as a tibble to the parsnip. This required that the data to be non-sparse which in your case it going to explode the data size up wildly! e.i. if you have 6,000,000 observations and just 2000 different tokens, you are going to end up with 96GB...

This is something (I'm the author of {textrecipes} and one of the developers on the tidymodels team) I want to happen at one point, but it is currently outside the range of my controls as we need to find a way to have sparse data in tibbles.

答案2

得分: 2

如果有人需要的话，我会总结一下我的发现。

存在两个问题：(i) 创建tf-idf矩阵需要大量内存，(ii) _tinymodels_目前只接受tibbles作为传入数据，正如EmilHvitfeldt友好地指出的那样。解决方案是以更节省内存的方式生成tf-idf数据集，使用通常的方法进行稀疏化，然后直接使用支持稀疏数据的模型进行工作。

最大的问题是现有的计算tf-idf的解决方案（我尝试过tm和tidytext）在内存使用效率上不高。我所做的是：

需要注意的是，我有足够的内存将所有文本加载到内存中。
将文本存储为一个arrow数据集，没有分组，并设置max_rows_per_file = 1000000（这个数字可以根据内存需求进行调整）。
计算并将计算tf-idf所需的变量存储为单独的arrow数据集：单词计数、文本长度和文档中单词出现次数。
循环遍历其中一个数据集的文件，将来自另外两个数据集的数据进行左连接（这在内存中进行，但因为每个文件仅包含部分总观测数据，所以不是问题）。
手动另存为parquet文件在数据集中。
将数据集作为数据集打开，收集并使用tidytext::cast_sparse转换成稀疏矩阵。

corpus %>% 
  write_dataset('tokenized_texts', max_rows_per_file = 1000000)
ds <- open_dataset('tokenized_texts')
# N是文本的总数
N <- ds %>%
  summarize(N = max(TextID)) %>%
  collect() %>%
  pull(N)
# 这个计算单词在给定文本中出现的次数
ds.n <- 
  ds %>%
  group_by(TextID, word) %>%
  count() %>%
  collect()
ds.n %>%
  ungroup() %>%
  write_dataset('tokenized_arrow/ds.n', max_rows_per_file = 1000000)
rm(ds.n)
gc()
# 这个计算数据集中的单词总数
ds.total <- 
  ds %>%   
  group_by(TextID) %>%
  count(name = 'TotalWords') %>%
  collect()
ds.total %>%
  ungroup() %>%
  write_dataset('tokenized_arrow/ds.total', max_rows_per_file = 1000000)
rm(ds.total)
gc()
# 这个计算单词在文本中出现（至少一次）的次数
ds.docs <- 
  ds %>%
  group_by(TextID, word) %>%
  summarize() %>%
  group_by(word) %>%
  count(name = 'Documents') %>%
  collect()
ds.docs %>%
  ungroup() %>%
  write_dataset('tokenized_arrow/ds.docs', max_rows_per_file = 1000000)
rm(ds.docs)
gc()
# 加载准备好的数据集
ds.n <- open_dataset('cache/tokenized_arrow/ds.n')
ds.total <- open_dataset('cache/tokenized_arrow/ds.total')
ds.docs <- open_dataset('cache/tokenized_arrow/ds.docs')
# 循环遍历（mclapply是过度的，这是一个非常快的步骤）。假设目录"final"存在。
files <- list.files('tokenized_arrow/ds.n', full.names = TRUE)
mclapply(files, mc.cores = parallel::detectCores() - 2, FUN = function(file) {
  outfile <- str_replace(file, 'ds\\.n', 'final')
  
  df <- read_parquet(file)
  ids <- unique(df$TextID)
  words <- unique(df$word)
  df %>% 
    left_join(
      ds.total %>% 
        filter(TextID %in% ids) %>% 
        collect()) %>%
    left_join(
      ds.docs %>%
        filter(word %in% words) %>%
        collect()
    ) %>%
    mutate(tf = n / TotalWords,
           idf = log(N / Documents),
           tf_idf = tf * idf) %>%
    write_parquet(outfile)
  return(NULL)
}) %>% invisible()
# 稀疏化
m <- 
  open_dataset('cache/tokenized_arrow/final/') %>%
  collect() %>%
  cast_sparse(TextID, word, tf_idf)

希望这能帮助你解决问题。

英文:

In case anybody needs it, I'll summarize my findings.

There are two problems: (i) creating a tf-idf matrix requires a lot of memory, and (ii) tinymodels currently only accepts tibbles as incoming data, as kindly pointed out by EmilHvitfeldt. The solution is to generate tf-idf dataset in a more memory-friendly way, sparsify with usual means, and then work directly with the models that support sparse data.

The biggest trouble was that existing solutions for calculating tf-idf (I tried tm and tidytext) are memory inefficient. What I did was the following:

Caveat is that I have enough memory to load all texts into memory in the first place.
Store texts as an arrow dataset with no grouping and max_rows_per_file = 1000000 (this number can be tailored to your memory requirements).
Compute and store as separate arrow datasets the variables needed for calculating tf-idf: word counts, text lengths, and word-in-document counts.
Loop through the files of one of the datasets, left-joining the data from the other two datasets (this happens in-memory, but because each file contains only a portion of total observations, it's not a problem).
Manually save out as a parquet file within a dataset.
Open the dataset as a dataset, collect, and tidytext::cast_sparse into a sparse matrix.

corpus %&gt;% 
write_dataset(&#39;tokenized_texts&#39;, max_rows_per_file = 1000000)
ds &lt;- open_dataset(&#39;tokenized_texts&#39;)
# N is the total number of texts
N &lt;- ds %&gt;% 
summarize(N = max(TextID)) %&gt;% 
collect() %&gt;% 
pull(N)
# this computes the number of times a word appears within a given text
ds.n &lt;- 
ds %&gt;% 
group_by(TextID, word) %&gt;% 
count() %&gt;% 
collect()
ds.n %&gt;% 
ungroup() %&gt;% 
write_dataset(&#39;tokenized_arrow/ds.n&#39;, max_rows_per_file = 1000000)
rm(ds.n)
gc()
# this computes the total number of words in the dataset
ds.total &lt;- 
ds %&gt;%   
group_by(TextID) %&gt;% 
count(name = &#39;TotalWords&#39;) %&gt;% 
collect()
ds.total %&gt;% 
ungroup() %&gt;% 
write_dataset(&#39;tokenized_arrow/ds.total&#39;, max_rows_per_file = 1000000)
rm(ds.total)
gc()
# this computes the number of times a word appears (at least once) in texts
ds.docs &lt;- 
ds %&gt;% 
group_by(TextID, word) %&gt;% 
summarize() %&gt;% 
group_by(word) %&gt;% 
count(name = &#39;Documents&#39;) %&gt;% 
collect()
ds.docs %&gt;% 
ungroup() %&gt;% 
write_dataset(&#39;tokenized_arrow/ds.docs&#39;, max_rows_per_file = 1000000)
rm(ds.docs)
gc()
# Load the prepared datasets
ds.n &lt;- open_dataset(&#39;cache/tokenized_arrow/ds.n&#39;)
ds.total &lt;- open_dataset(&#39;cache/tokenized_arrow/ds.total&#39;)
ds.docs &lt;- open_dataset(&#39;cache/tokenized_arrow/ds.docs&#39;)
# Loop through (mclapply was an overkill, this is a super fast step). Assumes the directory &quot;final&quot; exists.
files &lt;- list.files(&#39;tokenized_arrow/ds.n&#39;, full.names = TRUE)
mclapply(files, mc.cores = parallel::detectCores() - 2, FUN = function(file) {
outfile &lt;- str_replace(file, &#39;ds\\.n&#39;, &#39;final&#39;)
df &lt;- read_parquet(file)
ids &lt;- unique(df$TextID)
words &lt;- unique(df$word)
df %&gt;% 
left_join(
ds.total %&gt;% 
filter(TextID %in% ids) %&gt;% 
collect()) %&gt;% 
left_join(
ds.docs %&gt;%
filter(word %in% words) %&gt;%
collect()
) %&gt;% 
mutate(tf = n / TotalWords,
idf = log(N / Documents),
tf_idf = tf * idf) %&gt;% 
write_parquet(outfile)
return(NULL)
}) %&gt;% invisible()
# sparsify
m &lt;- 
open_dataset(&#39;cache/tokenized_arrow/final/&#39;) %&gt;% 
collect() %&gt;% 
cast_sparse(TextID, word, tf_idf)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

内存问题在获取TF-IDF数据时。

问题

Intro

Question

Full Code

Intro

Question

Full Code

答案1

答案2

如何解决此警告消息：“需要（间接）孤立包：’influenceR’”？

将分位数输出转化为表格

Could someone explain me the output of "str("Hello") == str("World!")" in R. I was expecting "TRUE"

使用readRDS()和哈希检索缓存的对象。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论