英文:
Multiple skips when reading a csv file
问题
我有一个非常大的文件,我只需要提取第1、100001和200001行的第一个元素,我可以这样提取:
x1 <- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 <- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 <- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]
我不知道读取的工作原理,但我认为这会强制进行一些不必要的读取和跳过。
我想知道在读取x2
后是否可以继续跳过,而不是从文件的开头再次开始。这将节省一些时间。
如果可能的话,我不想将整个文件(或整个第一列)加载到内存中。
英文:
I have a very large file of which I only need the first element of rows 1, 100001, 200001, which I extract like this:
x1 <- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 <- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 <- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]
I don't know how reading works, but I assume this forces some unnessesary reading/skipping.
I wonder if I could continue skipping after reading x2
in stead of starting at the beginning of the file again. That would save some time.
I do not want to have the whole file (or the whole first column) in memory (at some point) if I can avoid it.
答案1
得分: 3
以下是翻译好的内容:
这是一种使用 scan
的方法。它假设你正在读取数值数据,如果不是,请在对 scan
的调用中包括以下内容:
what = character()
请在末尾的测试文件中添加。
请注意,我跳过了10行,而不是1000行。
fl <- "~/Temp/so.csv"
sep = ","
skip <- 10L
vec <- NULL
skp <- 0L
x <- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) > 0L) {
vec <- c(vec, x)
skp <- skp + skip
x <- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#> [1] 1 11 21 31
创建于2023-06-08,使用 reprex v2.0.2。
数据
这是测试文件的内容(40行)。
1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60
英文:
Here is a way with scan
. It assumes you are reading numeric data, if not include
what = character()
in the calls to scan
. Test file at end.
Note that I'm skipping 10 lines, not 1000.
fl <- "~/Temp/so.csv"
sep = ","
skip <- 10L
vec <- NULL
skp <- 0L
x <- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) > 0L) {
vec <- c(vec, x)
skp <- skp + skip
x <- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#> [1] 1 11 21 31
<sup>Created on 2023-06-08 with reprex v2.0.2</sup>
Data
This is the contents of the test file (40 rows).
1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论