读取CSV文件时出现多次跳过

huangapple go评论75阅读模式
英文:

Multiple skips when reading a csv file

问题

我有一个非常大的文件,我只需要提取第1、100001和200001行的第一个元素,我可以这样提取:

x1 <- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 <- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 <- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]

我不知道读取的工作原理,但我认为这会强制进行一些不必要的读取和跳过。

我想知道在读取x2后是否可以继续跳过,而不是从文件的开头再次开始。这将节省一些时间。

如果可能的话,我不想将整个文件(或整个第一列)加载到内存中。

英文:

I have a very large file of which I only need the first element of rows 1, 100001, 200001, which I extract like this:

x1 &lt;- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 &lt;- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 &lt;- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]

I don't know how reading works, but I assume this forces some unnessesary reading/skipping.

I wonder if I could continue skipping after reading x2 in stead of starting at the beginning of the file again. That would save some time.

I do not want to have the whole file (or the whole first column) in memory (at some point) if I can avoid it.

答案1

得分: 3

以下是翻译好的内容:

这是一种使用 scan 的方法。它假设你正在读取数值数据,如果不是,请在对 scan 的调用中包括以下内容:

what = character()

请在末尾的测试文件中添加。

请注意,我跳过了10行,而不是1000行。

fl <- "~/Temp/so.csv"

sep = ","
skip <- 10L

vec <- NULL
skp <- 0L
x <- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) > 0L) {
  vec <- c(vec, x)
  skp <- skp + skip
  x <- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#> [1]  1 11 21 31

创建于2023-06-08,使用 reprex v2.0.2


数据

这是测试文件的内容(40行)。

1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60
英文:

Here is a way with scan. It assumes you are reading numeric data, if not include

what = character()

in the calls to scan. Test file at end.
Note that I'm skipping 10 lines, not 1000.

fl &lt;- &quot;~/Temp/so.csv&quot;

sep = &quot;,&quot;
skip &lt;- 10L

vec &lt;- NULL
skp &lt;- 0L
x &lt;- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) &gt; 0L) {
  vec &lt;- c(vec, x)
  skp &lt;- skp + skip
  x &lt;- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#&gt; [1]  1 11 21 31

<sup>Created on 2023-06-08 with reprex v2.0.2</sup>


Data

This is the contents of the test file (40 rows).

1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60

huangapple
  • 本文由 发表于 2023年6月8日 22:45:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76433056.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定