如何使 data.table 忽略额外列

huangapple go评论66阅读模式
英文:

how to make data.table omit extra col

问题

我正在尝试使用 data.table::fread 加载一个非常大的 tab 文件,我的输入数据可能在最后一列之前有一个额外的制表符,像这样:

# 整个表格都是用制表符分隔,我们只显示最后一个制表符为 \t
col1    col2    col3\t # 在这里有一个制表符,实际上我有很多列(>3000)
1    2    9\t
1    3    3\t
3    9    6\t

我注意到 fread 会在有额外列的情况下推断列名,但这并不正确:

# fread 推断的结果是
V1   col1    col2   col3

我不希望 V1 存在在第一列中。

我尝试过:

  1. 移动并重设列名,这样可以解决问题,但现在我计划在读取文件时使用 select 来选择列,所以我不能这样做。

  2. 使用 data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col),这样可以工作,但似乎比使用 data.table::fread 处理 gz 文件要慢。

英文:

I am trying to load a very large tab file by data.table::fread, my input data may have a extra tab before last col,like:

#The entire table is tab split, we just show finally tab as \t
col1    col2    col3\t # a tab here, I have many cols in fact (>3000)
1    2    9\t
1    3    3\t
3    9    6\t

I noted fread will infer colname if have extra col, but its not correctly

# fread infered as 
V1   col1    col2   col3

I dont want V1 exists in first

I have tried

  1. shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can't do that

  2. use data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col), Its worked, but seems very slower than process gz by data.table::fread

答案1

得分: 1

以下是翻译好的部分:

从此结果开始:

fread("input.tab", sep="\t")
#     col1  col2  col3     V4
#    <int> <int> <int> <lgcl>
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA

如果您预先知道存在多少列,那么您可以简单地使用 select= 选项:

fread("input.tab", select=1:3)
#     col1  col2  col3
#    <int> <int> <int>
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

或者,如果它是一个大文件,并且您想要更加灵活,那么您可以这样做:

# 读取足够多的行,以便您有足够的信心
# 最后一列都是NA,这是尾随制表符的症状
tmp <- fread("input.tab", sep="\t", nrows=10)
tmp
#     col1  col2  col3     V4
#    <int> <int> <int> <lgcl>
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA
ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread("input.tab", select=seq(ncols))
#     col1  col2  col3
#    <int> <int> <int>
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

如果您在 select_col 中有预定的列索引,那么您可以使用:

fread(..., select = select_col[select_col < ncols])

这种简单的启发式方法_可能_会受到欺骗,如果制表符问题不存在,但表的前 n 行是合法的空/ null。

英文:

Starting with this result:

fread(&quot;input.tab&quot;, sep=&quot;\t&quot;)
#     col1  col2  col3     V4
#    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgcl&gt;
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA

if you know how many columns exist ahead of time, then you can simply select= them:

fread(&quot;input.tab&quot;, select=1:3)
#     col1  col2  col3
#    &lt;int&gt; &lt;int&gt; &lt;int&gt;
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

Alternatively, if it's a large file and you want to be flexible, then you can do something like this:

# read in just enough rows so that you are confident-enough 
# that the last column is all NA, a symptom of the trailing-tab
tmp &lt;- fread(&quot;input.tab&quot;, sep=&quot;\t&quot;, nrows=10)
tmp
#     col1  col2  col3     V4
#    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgcl&gt;
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA
ncols &lt;- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread(&quot;input.tab&quot;, select=seq(ncols))
#     col1  col2  col3
#    &lt;int&gt; &lt;int&gt; &lt;int&gt;
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

If you have predetermined column indices in select_col, then you can use

fread(..., select = select_col[select_col &lt; ncols])

This naive heuristic can be fooled if the tab-problem does not exist but the first n rows of the table are legitimately empty/null.

答案2

得分: 1

Using drop from fread

Finding the desired number of columns through system with e.g. awk, assuming the unwanted column is the last one.

NUM = as.numeric(system('awk \'{print NF; exit}\' file', intern=T))
data.table::fread("file", header=T, drop=NUM + 1)
   col1 col2 col3
1:    1    2    9
2:    1    3    3
3:    3    9    6
英文:

Using drop from fread

Finding the desired number of columns through system with e.g. awk, assuming the unwanted column is the last one.

NUM = as.numeric(system(&#39;awk \&#39;{print NF; exit}\&#39; file&#39;, intern=T))
data.table::fread(&quot;file&quot;, header=T, drop=NUM + 1)
   col1 col2 col3
1:    1    2    9
2:    1    3    3
3:    3    9    6

huangapple
  • 本文由 发表于 2023年6月12日 22:11:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76457504.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定