使用quanteda来对大型数据集进行分词并限制RAM

huangapple go评论83阅读模式
英文:

Using quanteda to tokenize large datasets and limited RAM

问题

我有一个包含大约250万行文本的数据集,当我尝试一次性使用quanteda对整个数据集进行标记化时,遇到了内存问题。我的初始方法是将数据集分成较小的子集进行标记化,然后将结果合并成一个列表的列表。然而,我在实现期望的结果时遇到了困难。当使用purrr::flatten时,我得到了一个序列化的整数列表,对应于一个类型向量,而不是获取实际的标记。

对于如何修改代码或备用方法的任何建议将不胜感激。这是我迄今为止实施的代码:

  1. # 标记化函数
  2. tokenize_subset <- function(subset_corpus) {
  3. tokens(
  4. subset_corpus,
  5. remove_numbers = TRUE,
  6. remove_punct = TRUE,
  7. remove_symbols = TRUE,
  8. remove_url = TRUE,
  9. remove_separators = TRUE,
  10. split_hyphens = TRUE
  11. ) %>%
  12. tokens_split(separator = "[[:digit:]]", valuetype = "regex") %>%
  13. tokens_tolower()
  14. }
  15. # 对每个“ind”组应用标记化函数
  16. token_list <- lapply(unique(docvars(key_corpus, "ind")), function(i) {
  17. subset_corpus <- corpus_subset(key_corpus, subset = ind == i)
  18. tokenize_subset(subset_corpus)
  19. })
  20. token_list <- purrr::flatten(token_list)

如何修改代码或备用方法的任何建议将不胜感激。谢谢!

英文:

I have a dataset consisting of approximately 2.5 million rows of text, and I'm encountering memory issues when trying to tokenize the entire dataset at once using quanteda. My initial approach was to divide the dataset into smaller subsets for tokenization and then combine the results into a list of lists. However, I'm facing difficulties in achieving the desired outcome. When using purrr::flatten, I end up with a serialized list of integers corresponding to a vector of types, rather than obtaining the actual tokens.

I would greatly appreciate any suggestions or ideas on how to solve this problem. Here's the code I've implemented so far:

  1. # Tokenization function
  2. tokenize_subset &lt;- function(subset_corpus) {
  3. tokens(
  4. subset_corpus,
  5. remove_numbers = TRUE,
  6. remove_punct = TRUE,
  7. remove_symbols = TRUE,
  8. remove_url = TRUE,
  9. remove_separators = TRUE,
  10. split_hyphens = TRUE
  11. ) %&gt;%
  12. tokens_split(separator = &quot;[[:digit:]]&quot;, valuetype = &quot;regex&quot;) %&gt;%
  13. tokens_tolower()
  14. }
  15. # Apply tokenization function to each group of &quot;ind&quot;
  16. token_list &lt;- lapply(unique(docvars(key_corpus, &quot;ind&quot;)), function(i) {
  17. subset_corpus &lt;- corpus_subset(key_corpus, subset = ind == i)
  18. tokenize_subset(subset_corpus)
  19. })
  20. token_list &lt;- purrr::flatten(token_list)

Any suggestions on how to modify the code or alternative approaches would be highly appreciated. Thank you!

答案1

得分: 1

很难在没有你的数据集或更多关于250万个文档的长度或你的系统限制(RAM)的信息的情况下解决这个问题。

但你可以尝试这样做:将输入文件分割成子集(例如,每个子集500k),然后将每个子集加载为语料库,对其进行分词并将分词对象保存到磁盘上。清除内存,然后处理下一个切片。最后,清除内存,并使用 c() 将这些分词对象组合成一个单一的分词对象。

另外,如果你可以将整个分词对象加载到内存中,尝试设置:

  1. quanteda_options(tokens_block_size = 2000)

或一个更低的数字,因为这实际上会将文档分批处理,并在内部重新编译tokens使用的整数表。默认值是100000,但使用较低的数字可能会避免触发内存限制。

英文:

Hard to know how to work around this without your dataset or knowing more about the length of the 2.5 million documents, or your system limits (RAM).

But you could try this: splitting the input file into subsets (say, 500k each) and then loading each as a corpus, tokenising it, and saving the tokens object to disk. Clear the memory, then do the next slice. In the end, clear the memory, and use c() to combine the tokens into a single tokens object.

Alternatively, if you can load the entire tokens object into memory, try setting

  1. quanteda_options(tokens_block_size = 2000)

or a lower number, since this effectively batches the documents and internally recompiles the integer table that tokens uses. The default is 100000 but you might avoid hitting memory limits by using a lower number.

huangapple
  • 本文由 发表于 2023年5月26日 15:45:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76338692.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定