2023年6月12日 22:11:03go评论91阅读模式

英文:

how to make data.table omit extra col

问题

我正在尝试使用 data.table::fread 加载一个非常大的 tab 文件，我的输入数据可能在最后一列之前有一个额外的制表符，像这样：

# 整个表格都是用制表符分隔，我们只显示最后一个制表符为 \t
col1    col2    col3\t # 在这里有一个制表符，实际上我有很多列（>3000）
1    2    9\t
1    3    3\t
3    9    6\t

我注意到 fread 会在有额外列的情况下推断列名，但这并不正确：

# fread 推断的结果是
V1   col1    col2   col3

我不希望 V1 存在在第一列中。

我尝试过：

移动并重设列名，这样可以解决问题，但现在我计划在读取文件时使用 select 来选择列，所以我不能这样做。
使用 data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col)，这样可以工作，但似乎比使用 data.table::fread 处理 gz 文件要慢。

英文:

I am trying to load a very large tab file by data.table::fread, my input data may have a extra tab before last col,like:

#The entire table is tab split, we just show finally tab as \t
col1    col2    col3\t # a tab here, I have many cols in fact (&gt;3000)
1    2    9\t
1    3    3\t
3    9    6\t

I noted fread will infer colname if have extra col, but its not correctly

# fread infered as 
V1   col1    col2   col3

I dont want V1 exists in first

I have tried

shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can't do that
use data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col), Its worked, but seems very slower than process gz by data.table::fread

答案1

得分: 1

以下是翻译好的部分：

从此结果开始：

fread("input.tab", sep="\t")
#     col1  col2  col3     V4
#    <int> <int> <int> <lgcl>
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA

如果您预先知道存在多少列，那么您可以简单地使用 select= 选项：

fread("input.tab", select=1:3)
#     col1  col2  col3
#    <int> <int> <int>
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

或者，如果它是一个大文件，并且您想要更加灵活，那么您可以这样做：

# 读取足够多的行，以便您有足够的信心
# 最后一列都是NA，这是尾随制表符的症状
tmp <- fread("input.tab", sep="\t", nrows=10)
tmp
#     col1  col2  col3     V4
#    <int> <int> <int> <lgcl>
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA
ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread("input.tab", select=seq(ncols))
#     col1  col2  col3
#    <int> <int> <int>
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

如果您在 select_col 中有预定的列索引，那么您可以使用：

fread(..., select = select_col[select_col < ncols])

这种简单的启发式方法_可能_会受到欺骗，如果制表符问题不存在，但表的前 n 行是合法的空/ null。

英文:

Starting with this result:

fread(&quot;input.tab&quot;, sep=&quot;\t&quot;)
#     col1  col2  col3     V4
#    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgcl&gt;
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA

if you know how many columns exist ahead of time, then you can simply select= them:

fread(&quot;input.tab&quot;, select=1:3)
#     col1  col2  col3
#    &lt;int&gt; &lt;int&gt; &lt;int&gt;
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

Alternatively, if it's a large file and you want to be flexible, then you can do something like this:

# read in just enough rows so that you are confident-enough 
# that the last column is all NA, a symptom of the trailing-tab
tmp &lt;- fread(&quot;input.tab&quot;, sep=&quot;\t&quot;, nrows=10)
tmp
#     col1  col2  col3     V4
#    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgcl&gt;
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA
ncols &lt;- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread(&quot;input.tab&quot;, select=seq(ncols))
#     col1  col2  col3
#    &lt;int&gt; &lt;int&gt; &lt;int&gt;
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

If you have predetermined column indices in select_col, then you can use

fread(..., select = select_col[select_col &lt; ncols])

This naive heuristic can be fooled if the tab-problem does not exist but the first n rows of the table are legitimately empty/null.

答案2

得分: 1

Using drop from fread

Finding the desired number of columns through system with e.g. awk, assuming the unwanted column is the last one.

NUM = as.numeric(system('awk \'{print NF; exit}\' file', intern=T))

data.table::fread("file", header=T, drop=NUM + 1)
   col1 col2 col3
1:    1    2    9
2:    1    3    3
3:    3    9    6

英文:

Using drop from fread

Finding the desired number of columns through system with e.g. awk, assuming the unwanted column is the last one.

NUM = as.numeric(system(&#39;awk \&#39;{print NF; exit}\&#39; file&#39;, intern=T))

data.table::fread(&quot;file&quot;, header=T, drop=NUM + 1)
   col1 col2 col3
1:    1    2    9
2:    1    3    3
3:    3    9    6

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使 data.table 忽略额外列

问题

答案1

答案2

找不到“mixstock”包中的“calc.RL.0”函数。

按时间阈值在R中计算真值、假值和总和值。

IV回归与聚类标准误

在R中向多个分组的小提琴图中添加中位数和四分位范围。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。