如何使 data.table 忽略额外列

huangapple go评论91阅读模式
英文:

how to make data.table omit extra col

问题

我正在尝试使用 data.table::fread 加载一个非常大的 tab 文件,我的输入数据可能在最后一列之前有一个额外的制表符,像这样:

  1. # 整个表格都是用制表符分隔,我们只显示最后一个制表符为 \t
  2. col1 col2 col3\t # 在这里有一个制表符,实际上我有很多列(>3000)
  3. 1 2 9\t
  4. 1 3 3\t
  5. 3 9 6\t

我注意到 fread 会在有额外列的情况下推断列名,但这并不正确:

  1. # fread 推断的结果是
  2. V1 col1 col2 col3

我不希望 V1 存在在第一列中。

我尝试过:

  1. 移动并重设列名,这样可以解决问题,但现在我计划在读取文件时使用 select 来选择列,所以我不能这样做。

  2. 使用 data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col),这样可以工作,但似乎比使用 data.table::fread 处理 gz 文件要慢。

英文:

I am trying to load a very large tab file by data.table::fread, my input data may have a extra tab before last col,like:

  1. #The entire table is tab split, we just show finally tab as \t
  2. col1 col2 col3\t # a tab here, I have many cols in fact (>3000)
  3. 1 2 9\t
  4. 1 3 3\t
  5. 3 9 6\t

I noted fread will infer colname if have extra col, but its not correctly

  1. # fread infered as
  2. V1 col1 col2 col3

I dont want V1 exists in first

I have tried

  1. shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can't do that

  2. use data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col), Its worked, but seems very slower than process gz by data.table::fread

答案1

得分: 1

以下是翻译好的部分:

从此结果开始:

  1. fread("input.tab", sep="\t")
  2. # col1 col2 col3 V4
  3. # <int> <int> <int> <lgcl>
  4. # 1: 1 2 9 NA
  5. # 2: 1 3 3 NA
  6. # 3: 3 9 6 NA

如果您预先知道存在多少列,那么您可以简单地使用 select= 选项:

  1. fread("input.tab", select=1:3)
  2. # col1 col2 col3
  3. # <int> <int> <int>
  4. # 1: 1 2 9
  5. # 2: 1 3 3
  6. # 3: 3 9 6

或者,如果它是一个大文件,并且您想要更加灵活,那么您可以这样做:

  1. # 读取足够多的行,以便您有足够的信心
  2. # 最后一列都是NA,这是尾随制表符的症状
  3. tmp <- fread("input.tab", sep="\t", nrows=10)
  4. tmp
  5. # col1 col2 col3 V4
  6. # <int> <int> <int> <lgcl>
  7. # 1: 1 2 9 NA
  8. # 2: 1 3 3 NA
  9. # 3: 3 9 6 NA
  10. ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
  11. ncols
  12. # [1] 3
  13. fread("input.tab", select=seq(ncols))
  14. # col1 col2 col3
  15. # <int> <int> <int>
  16. # 1: 1 2 9
  17. # 2: 1 3 3
  18. # 3: 3 9 6

如果您在 select_col 中有预定的列索引,那么您可以使用:

  1. fread(..., select = select_col[select_col < ncols])

这种简单的启发式方法_可能_会受到欺骗,如果制表符问题不存在,但表的前 n 行是合法的空/ null。

英文:

Starting with this result:

  1. fread(&quot;input.tab&quot;, sep=&quot;\t&quot;)
  2. # col1 col2 col3 V4
  3. # &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgcl&gt;
  4. # 1: 1 2 9 NA
  5. # 2: 1 3 3 NA
  6. # 3: 3 9 6 NA

if you know how many columns exist ahead of time, then you can simply select= them:

  1. fread(&quot;input.tab&quot;, select=1:3)
  2. # col1 col2 col3
  3. # &lt;int&gt; &lt;int&gt; &lt;int&gt;
  4. # 1: 1 2 9
  5. # 2: 1 3 3
  6. # 3: 3 9 6

Alternatively, if it's a large file and you want to be flexible, then you can do something like this:

  1. # read in just enough rows so that you are confident-enough
  2. # that the last column is all NA, a symptom of the trailing-tab
  3. tmp &lt;- fread(&quot;input.tab&quot;, sep=&quot;\t&quot;, nrows=10)
  4. tmp
  5. # col1 col2 col3 V4
  6. # &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgcl&gt;
  7. # 1: 1 2 9 NA
  8. # 2: 1 3 3 NA
  9. # 3: 3 9 6 NA
  10. ncols &lt;- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
  11. ncols
  12. # [1] 3
  13. fread(&quot;input.tab&quot;, select=seq(ncols))
  14. # col1 col2 col3
  15. # &lt;int&gt; &lt;int&gt; &lt;int&gt;
  16. # 1: 1 2 9
  17. # 2: 1 3 3
  18. # 3: 3 9 6

If you have predetermined column indices in select_col, then you can use

  1. fread(..., select = select_col[select_col &lt; ncols])

This naive heuristic can be fooled if the tab-problem does not exist but the first n rows of the table are legitimately empty/null.

答案2

得分: 1

Using drop from fread

Finding the desired number of columns through system with e.g. awk, assuming the unwanted column is the last one.

  1. NUM = as.numeric(system('awk \'{print NF; exit}\' file', intern=T))
  1. data.table::fread("file", header=T, drop=NUM + 1)
  2. col1 col2 col3
  3. 1: 1 2 9
  4. 2: 1 3 3
  5. 3: 3 9 6
英文:

Using drop from fread

Finding the desired number of columns through system with e.g. awk, assuming the unwanted column is the last one.

  1. NUM = as.numeric(system(&#39;awk \&#39;{print NF; exit}\&#39; file&#39;, intern=T))
  1. data.table::fread(&quot;file&quot;, header=T, drop=NUM + 1)
  2. col1 col2 col3
  3. 1: 1 2 9
  4. 2: 1 3 3
  5. 3: 3 9 6

huangapple
  • 本文由 发表于 2023年6月12日 22:11:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76457504.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定