2023年6月12日 20:45:00go评论143阅读模式

英文:

data.table::fread fails for larger file (long vectors not supported yet)

问题

fread() 在读取大约335GB的文件时失败，出现以下错误。希望能提供解决方法。

opt$input_file &lt;- &quot;sample-009_T/per_read_modified_base_calls.txt&quot;
Error in data.table::fread(opt$input_file, nThread = 16) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Execution halted

文件大小和一部分内容如下：

(base) bash-4.2$ ls -thl per_read_modified_base_calls.txt
-rw-r--r-- 1 lih7 user 335G May 31 15:24 per_read_modified_base_calls.txt
(base) bash-4.2$ head per_read_modified_base_calls.txt 
read_id chrm    strand  pos     mod_log_prob    can_log_prob    mod_base
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c    chr12   +       94372964        -8.814943313598633      -8.695793370588385      h
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c    chr12   +       94372964        -0.00031583529198542237 -8.695793370588385      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929450       -3.0660934448242188     -5.948376270726361      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929450       -0.05046514421701431    -5.948376270726361      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929897       -8.683683395385742      -9.392607152489518      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929897       -0.00025269604520872235 -9.392607152489518      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929959       -8.341853141784668      -8.957908916643804      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929959       -0.0003671127778943628  -8.957908916643804      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929670       -3.8058860301971436     -9.161674497706297      h

英文:

fread() fails when reading large file ~335GB with this error. appreciate any suggestions on how to resolve this.

opt$input_file &lt;- &quot;sample-009_T/per_read_modified_base_calls.txt&quot;
Error in data.table::fread(opt$input_file, nThread = 16) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Execution halted

size and snippet of file

(base) bash-4.2$ ls -thl per_read_modified_base_calls.txt
-rw-r--r-- 1 lih7 user 335G May 31 15:24 per_read_modified_base_calls.txt
(base) bash-4.2$ head per_read_modified_base_calls.txt 
read_id chrm    strand  pos     mod_log_prob    can_log_prob    mod_base
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c    chr12   +       94372964        -8.814943313598633      -8.695793370588385      h
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c    chr12   +       94372964        -0.00031583529198542237 -8.695793370588385      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929450       -3.0660934448242188     -5.948376270726361      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929450       -0.05046514421701431    -5.948376270726361      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929897       -8.683683395385742      -9.392607152489518      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929897       -0.00025269604520872235 -9.392607152489518      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929959       -8.341853141784668      -8.957908916643804      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929959       -0.0003671127778943628  -8.957908916643804      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929670       -3.8058860301971436     -9.161674497706297      h

答案1

得分: 7

似乎不太可能您的系统具有足够的RAM来加载335GB大小的文件。我建议您找到一种“懒惰”的方法来读取您的数据。

提前说明一下：我假设文件确实是制表符分隔的。如果不是，那么我不知道任何懒惰的方法是否能够很好地工作...

由于您标记了[data.table]，除非您仅仅是因为它被声称具有内存效率而尝试使用data.table（这当然是可能的...而且它确实很有效），我将假设您想要继续使用data.table语法，这不是立即由下面列出的arrow/duckdb支持的。然而，一旦您使用collect()函数收集数据，就可以轻松地使用as.data.table()将其转换为data.table对象，此后您可以继续使用data.table语法。

arrow

使用arrow包的一个（众多）好处之一是，当与dplyr一起使用时，它允许“懒惰”筛选。

arr <- arrow::read_delim_arrow("calls.txt", delim = "\t", as_data_frame = FALSE)
arr
# Table
# 9 rows x 7 columns
# $read_id <string>
# $chrm <string>
# $strand <string>
# $pos <int64>
# $mod_log_prob <double>
# $can_log_prob <double>
# $mod_base <string>

这本身并不令人印象深刻，但我们可以构建一系列（有限的）dplyr表达式，然后在准备好的时候调用collect()函数，此时数据最终从磁盘加载到内存中。

library(dplyr)
arr %>%
  filter(grepl("d1c2", read_id)) %>%
  collect()
# # A tibble: 2 x 7
#   read_id                              chrm  strand      pos mod_log_prob can_log_prob mod_base
#   <chr>                                <chr> <chr>     <int>        <dbl>        <dbl> <chr>   
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -8.81            -8.70 h       
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -0.000316        -8.70 m       
arr %>%
  count(chrm) %>%
  collect()
# # A tibble: 2 x 2
#   chrm      n
#   <chr> <int>
# 1 chr12     2
# 2 chr10     7
arr %>%
  group_by(chrm) %>%
  summarize(across(c(mod_log_prob, can_log_prob), ~ max(.))) %>%
  collect()
# # A tibble: 2 x 3
#   chrm  mod_log_prob can_log_prob
#   <chr>        <dbl>        <dbl>
# 1 chr12    -0.000316        -8.70
# 2 chr10    -0.000253        -5.95

在这些示例中，在调用collect()函数之前，磁盘上的数据不会被读入内存中，因此从磁盘读取到R中的数据可以很小。（请注意，生成太大的对象的摘要仍然会失败，这并不会在视觉上为您提供更多的RAM。）

（完整或接近完整的支持dplyr操作列表可以在这里找到：https://arrow.apache.org/docs/dev/r/reference/acero.html)。

duckdb

（这也可以使用RSQLite轻松完成，它们都具有类似的功能。）

library(duckdb)
db <- dbConnect(duckdb::duckdb(), dbdir = "calls.db")
duckdb_read_csv(db, name = "calls", files = "calls.txt", delim = "\t")
dbListFields(db, "calls")
# [1] "read_id"      "chrm"         "strand"       "pos"          "mod_log_prob" "can_log_prob" "mod_base"    
dbGetQuery(db, "select read_id, chrm, mod_log_prob from calls where read_id like 'd1c2%'")
#                                read_id  chrm  mod_log_prob
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -8.8149433136
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -0.0003158353

如果您已经熟悉SQL，那么这种方法可能是不错的选择。

请注意，您仍然可以在这种方法中使用dplyr：

library(dplyr)
calls_table <- tbl(db, "calls")
calls_table
# # Source:   table<calls> [9 x 7]
# # Database: DuckDB 0.7.1 [r2@Linux 6.2.0-20-generic:R 4.2.3/calls.db]
#   read_id                              chrm  strand       pos mod_log_prob can_log_prob mod_base
#   <chr>                                <chr> <chr>      <int>        <dbl>        <dbl> <chr>   
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +       94372964    -8.81            -8.70 h       
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +       94372964    -0.000316        -8.70 m       
# 3 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929450    -3.07            -5.95 h       
# 4 2109b127-c835-47f3-b215-c238438829
<details>
<summary>英文:</summary>
It seems unlikely that you have enough RAM on your system to load a file of size 335GB. I suggest you find a &quot;lazy&quot; way of reading your data.
Up front: I&#39;m assuming the file is really tab-delimited. If not, then I don&#39;t know that any lazy way is going to work well ...
Since you&#39;ve tagged [tag:data.table], unless you were attempting to use `data.table` solely for its alleged memory-efficiency (certainly possible ... and it is efficient), I&#39;ll assume that you&#39;d like to resume with `data.table`-syntax, not immediately supported by either of arrow/duckdb listed below. However, once you `collect()` the data, you can easily `as.data.table()` it, at which point you go back to using `data.table`-syntax.
# arrow
One (of many) benefits of using the `arrow` package is that it allows &quot;lazy&quot; filtering when used with `dplyr`.
```r
arr &lt;- arrow::read_delim_arrow(&quot;calls.txt&quot;, delim = &quot;\t&quot;, as_data_frame = FALSE)
arr
# Table
# 9 rows x 7 columns
# $read_id &lt;string&gt;
# $chrm &lt;string&gt;
# $strand &lt;string&gt;
# $pos &lt;int64&gt;
# $mod_log_prob &lt;double&gt;
# $can_log_prob &lt;double&gt;
# $mod_base &lt;string&gt;

This by itself does not impress, but we can build a complete sequence of (limited) dplyr expressions and then when ready, call collect() at which point the data is finally pulled from disk and into memory.

library(dplyr)
arr %&gt;%
  filter(grepl(&quot;d1c2&quot;, read_id)) %&gt;%
  collect()
# # A tibble: 2 &#215; 7
#   read_id                              chrm  strand      pos mod_log_prob can_log_prob mod_base
#   &lt;chr&gt;                                &lt;chr&gt; &lt;chr&gt;     &lt;int&gt;        &lt;dbl&gt;        &lt;dbl&gt; &lt;chr&gt;   
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -8.81            -8.70 h       
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -0.000316        -8.70 m       
arr %&gt;%
  count(chrm) %&gt;%
  collect()
# # A tibble: 2 &#215; 2
#   chrm      n
#   &lt;chr&gt; &lt;int&gt;
# 1 chr12     2
# 2 chr10     7
arr %&gt;%
  group_by(chrm) %&gt;%
  summarize(across(c(mod_log_prob, can_log_prob), ~ max(.))) %&gt;%
  collect()
# # A tibble: 2 &#215; 3
#   chrm  mod_log_prob can_log_prob
#   &lt;chr&gt;        &lt;dbl&gt;        &lt;dbl&gt;
# 1 chr12    -0.000316        -8.70
# 2 chr10    -0.000253        -5.95

In each of those examples, the data on disk is not read into memory until collect(), so the data read into R can be small enough. (Note that summaries that result in too-big-objects are still going to fail, this does not magically give you more apparent RAM.)

(A full or near-full list of supported dplyr actions can be found here: https://arrow.apache.org/docs/dev/r/reference/acero.html).

duckdb

(This can also be done as easily with RSQLite, they both have similar functionality.)

library(duckdb)
db &lt;- dbConnect(duckdb::duckdb(), dbdir = &quot;calls.db&quot;)
duckdb_read_csv(db, name = &quot;calls&quot;, files = &quot;calls.txt&quot;, delim = &quot;\t&quot;)
dbListFields(db, &quot;calls&quot;)
# [1] &quot;read_id&quot;      &quot;chrm&quot;         &quot;strand&quot;       &quot;pos&quot;          &quot;mod_log_prob&quot; &quot;can_log_prob&quot; &quot;mod_base&quot;    
dbGetQuery(db, &quot;select read_id, chrm, mod_log_prob from calls where read_id like &#39;d1c2%&#39;&quot;)
#                                read_id  chrm  mod_log_prob
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -8.8149433136
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -0.0003158353

If you're already familiar with SQL, then this approach may be good.

Note that you can still use dplyr with this approach as well:

library(dplyr)
calls_table &lt;- tbl(db, &quot;calls&quot;)
calls_table
# # Source:   table&lt;calls&gt; [9 x 7]
# # Database: DuckDB 0.7.1 [r2@Linux 6.2.0-20-generic:R 4.2.3/calls.db]
#   read_id                              chrm  strand       pos mod_log_prob can_log_prob mod_base
#   &lt;chr&gt;                                &lt;chr&gt; &lt;chr&gt;      &lt;int&gt;        &lt;dbl&gt;        &lt;dbl&gt; &lt;chr&gt;   
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +       94372964    -8.81            -8.70 h       
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +       94372964    -0.000316        -8.70 m       
# 3 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929450    -3.07            -5.95 h       
# 4 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929450    -0.0505          -5.95 m       
# 5 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929897    -8.68            -9.39 h       
# 6 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929897    -0.000253        -9.39 m       
# 7 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929959    -8.34            -8.96 h       
# 8 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929959    -0.000367        -8.96 m       
# 9 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929670    -3.81            -9.16 h

Note that here it looks like it has read all of the data into memory, but it is just giving you a sample of the data; when you have many rows, it'll just load in a few to show what it could be, still requiring you to eventually collect(). Mimicking above:


calls_table %&gt;%
  filter(grepl(&quot;d1c2&quot;, read_id)) %&gt;%
  collect()
# # A tibble: 2 &#215; 7
#   read_id                              chrm  strand      pos mod_log_prob can_log_prob mod_base
#   &lt;chr&gt;                                &lt;chr&gt; &lt;chr&gt;     &lt;int&gt;        &lt;dbl&gt;        &lt;dbl&gt; &lt;chr&gt;   
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -8.81            -8.70 h       
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -0.000316        -8.70 m

Others

There are several other packages that might also be useful here. I don't have experience with them.

(I'll add to this list as others make suggestions. I'm neither endorsing nor shaming any of these packages, I'm limited to my experience and time-available to research for this question

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

data.table::fread fails for larger file (long vectors not supported yet)

问题

答案1

arrow

duckdb

duckdb

Others

基于另一列中用户输入的值，在列中反应性地进行计算。

截断数字，但它们未显示正确数量的字符。

如何在 logistic svyglm() 中添加 p 值到 odds ratio？

在保存绘图之前存储ggplot对象，计算时间会发生很大的变化。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论