英文:
how to make data.table omit extra col
问题
我正在尝试使用 data.table::fread
加载一个非常大的 tab 文件,我的输入数据可能在最后一列之前有一个额外的制表符,像这样:
# 整个表格都是用制表符分隔,我们只显示最后一个制表符为 \t
col1 col2 col3\t # 在这里有一个制表符,实际上我有很多列(>3000)
1 2 9\t
1 3 3\t
3 9 6\t
我注意到 fread
会在有额外列的情况下推断列名,但这并不正确:
# fread 推断的结果是
V1 col1 col2 col3
我不希望 V1 存在在第一列中。
我尝试过:
-
移动并重设列名,这样可以解决问题,但现在我计划在读取文件时使用 select 来选择列,所以我不能这样做。
-
使用
data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col)
,这样可以工作,但似乎比使用data.table::fread
处理 gz 文件要慢。
英文:
I am trying to load a very large tab file by data.table::fread
, my input data may have a extra tab before last col,like:
#The entire table is tab split, we just show finally tab as \t
col1 col2 col3\t # a tab here, I have many cols in fact (>3000)
1 2 9\t
1 3 3\t
3 9 6\t
I noted fread
will infer colname if have extra col, but its not correctly
# fread infered as
V1 col1 col2 col3
I dont want V1 exists in first
I have tried
-
shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can't do that
-
use
data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col)
, Its worked, but seems very slower than process gz bydata.table::fread
答案1
得分: 1
以下是翻译好的部分:
从此结果开始:
fread("input.tab", sep="\t")
# col1 col2 col3 V4
# <int> <int> <int> <lgcl>
# 1: 1 2 9 NA
# 2: 1 3 3 NA
# 3: 3 9 6 NA
如果您预先知道存在多少列,那么您可以简单地使用 select=
选项:
fread("input.tab", select=1:3)
# col1 col2 col3
# <int> <int> <int>
# 1: 1 2 9
# 2: 1 3 3
# 3: 3 9 6
或者,如果它是一个大文件,并且您想要更加灵活,那么您可以这样做:
# 读取足够多的行,以便您有足够的信心
# 最后一列都是NA,这是尾随制表符的症状
tmp <- fread("input.tab", sep="\t", nrows=10)
tmp
# col1 col2 col3 V4
# <int> <int> <int> <lgcl>
# 1: 1 2 9 NA
# 2: 1 3 3 NA
# 3: 3 9 6 NA
ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread("input.tab", select=seq(ncols))
# col1 col2 col3
# <int> <int> <int>
# 1: 1 2 9
# 2: 1 3 3
# 3: 3 9 6
如果您在 select_col
中有预定的列索引,那么您可以使用:
fread(..., select = select_col[select_col < ncols])
这种简单的启发式方法_可能_会受到欺骗,如果制表符问题不存在,但表的前 n
行是合法的空/ null。
英文:
Starting with this result:
fread("input.tab", sep="\t")
# col1 col2 col3 V4
# <int> <int> <int> <lgcl>
# 1: 1 2 9 NA
# 2: 1 3 3 NA
# 3: 3 9 6 NA
if you know how many columns exist ahead of time, then you can simply select=
them:
fread("input.tab", select=1:3)
# col1 col2 col3
# <int> <int> <int>
# 1: 1 2 9
# 2: 1 3 3
# 3: 3 9 6
Alternatively, if it's a large file and you want to be flexible, then you can do something like this:
# read in just enough rows so that you are confident-enough
# that the last column is all NA, a symptom of the trailing-tab
tmp <- fread("input.tab", sep="\t", nrows=10)
tmp
# col1 col2 col3 V4
# <int> <int> <int> <lgcl>
# 1: 1 2 9 NA
# 2: 1 3 3 NA
# 3: 3 9 6 NA
ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread("input.tab", select=seq(ncols))
# col1 col2 col3
# <int> <int> <int>
# 1: 1 2 9
# 2: 1 3 3
# 3: 3 9 6
If you have predetermined column indices in select_col
, then you can use
fread(..., select = select_col[select_col < ncols])
This naive heuristic can be fooled if the tab-problem does not exist but the first n
rows of the table are legitimately empty/null.
答案2
得分: 1
Using drop
from fread
Finding the desired number of columns through system
with e.g. awk
, assuming the unwanted column is the last one.
NUM = as.numeric(system('awk \'{print NF; exit}\' file', intern=T))
data.table::fread("file", header=T, drop=NUM + 1)
col1 col2 col3
1: 1 2 9
2: 1 3 3
3: 3 9 6
英文:
Using drop
from fread
Finding the desired number of columns through system
with e.g. awk
, assuming the unwanted column is the last one.
NUM = as.numeric(system('awk \'{print NF; exit}\' file', intern=T))
data.table::fread("file", header=T, drop=NUM + 1)
col1 col2 col3
1: 1 2 9
2: 1 3 3
3: 3 9 6
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论