2023年5月26日 15:45:41go评论83阅读模式

英文:

Using quanteda to tokenize large datasets and limited RAM

问题

我有一个包含大约250万行文本的数据集，当我尝试一次性使用quanteda对整个数据集进行标记化时，遇到了内存问题。我的初始方法是将数据集分成较小的子集进行标记化，然后将结果合并成一个列表的列表。然而，我在实现期望的结果时遇到了困难。当使用purrr::flatten时，我得到了一个序列化的整数列表，对应于一个类型向量，而不是获取实际的标记。

对于如何修改代码或备用方法的任何建议将不胜感激。这是我迄今为止实施的代码：

# 标记化函数
tokenize_subset <- function(subset_corpus) {
  tokens(
    subset_corpus,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    remove_separators = TRUE,
    split_hyphens = TRUE
  ) %>%
    tokens_split(separator = "[[:digit:]]", valuetype = "regex") %>%
    tokens_tolower()
}
# 对每个“ind”组应用标记化函数
token_list <- lapply(unique(docvars(key_corpus, "ind")), function(i) {
  subset_corpus <- corpus_subset(key_corpus, subset = ind == i)
  tokenize_subset(subset_corpus)
})
token_list <- purrr::flatten(token_list)

如何修改代码或备用方法的任何建议将不胜感激。谢谢！

英文:

I have a dataset consisting of approximately 2.5 million rows of text, and I'm encountering memory issues when trying to tokenize the entire dataset at once using quanteda. My initial approach was to divide the dataset into smaller subsets for tokenization and then combine the results into a list of lists. However, I'm facing difficulties in achieving the desired outcome. When using purrr::flatten, I end up with a serialized list of integers corresponding to a vector of types, rather than obtaining the actual tokens.

I would greatly appreciate any suggestions or ideas on how to solve this problem. Here's the code I've implemented so far:

# Tokenization function
tokenize_subset &lt;- function(subset_corpus) {
  tokens(
    subset_corpus,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    remove_separators = TRUE,
    split_hyphens = TRUE
  ) %&gt;%
    tokens_split(separator = &quot;[[:digit:]]&quot;, valuetype = &quot;regex&quot;) %&gt;%
    tokens_tolower()
}
# Apply tokenization function to each group of &quot;ind&quot;
token_list &lt;- lapply(unique(docvars(key_corpus, &quot;ind&quot;)), function(i) {
  subset_corpus &lt;- corpus_subset(key_corpus, subset = ind == i)
  tokenize_subset(subset_corpus)
})
token_list &lt;- purrr::flatten(token_list)

Any suggestions on how to modify the code or alternative approaches would be highly appreciated. Thank you!

答案1

得分: 1

很难在没有你的数据集或更多关于250万个文档的长度或你的系统限制（RAM）的信息的情况下解决这个问题。

但你可以尝试这样做：将输入文件分割成子集（例如，每个子集500k），然后将每个子集加载为语料库，对其进行分词并将分词对象保存到磁盘上。清除内存，然后处理下一个切片。最后，清除内存，并使用 c() 将这些分词对象组合成一个单一的分词对象。

另外，如果你可以将整个分词对象加载到内存中，尝试设置：

quanteda_options(tokens_block_size = 2000)

或一个更低的数字，因为这实际上会将文档分批处理，并在内部重新编译tokens使用的整数表。默认值是100000，但使用较低的数字可能会避免触发内存限制。

英文:

Hard to know how to work around this without your dataset or knowing more about the length of the 2.5 million documents, or your system limits (RAM).

But you could try this: splitting the input file into subsets (say, 500k each) and then loading each as a corpus, tokenising it, and saving the tokens object to disk. Clear the memory, then do the next slice. In the end, clear the memory, and use c() to combine the tokens into a single tokens object.

Alternatively, if you can load the entire tokens object into memory, try setting

quanteda_options(tokens_block_size = 2000)

or a lower number, since this effectively batches the documents and internally recompiles the integer table that tokens uses. The default is 100000 but you might avoid hitting memory limits by using a lower number.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用quanteda来对大型数据集进行分词并限制RAM

问题

答案1

How do I hide a column from a DT table output?

Environmental problems while predicting from gaulss-gams with a custom variance function inside a package

Sweave, Shiny: 无法在服务器上生成 PDF。

如何使用字符向量创建具有字符向量的高图表的y轴

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。