2023年6月8日 22:45:21go评论103阅读模式

英文:

Multiple skips when reading a csv file

问题

我有一个非常大的文件，我只需要提取第1、100001和200001行的第一个元素，我可以这样提取：

x1 <- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 <- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 <- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]

我不知道读取的工作原理，但我认为这会强制进行一些不必要的读取和跳过。

我想知道在读取x2后是否可以继续跳过，而不是从文件的开头再次开始。这将节省一些时间。

如果可能的话，我不想将整个文件（或整个第一列）加载到内存中。

英文:

I have a very large file of which I only need the first element of rows 1, 100001, 200001, which I extract like this:

x1 &lt;- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 &lt;- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 &lt;- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]

I don't know how reading works, but I assume this forces some unnessesary reading/skipping.

I wonder if I could continue skipping after reading x2 in stead of starting at the beginning of the file again. That would save some time.

I do not want to have the whole file (or the whole first column) in memory (at some point) if I can avoid it.

答案1

得分: 3

以下是翻译好的内容：

这是一种使用 scan 的方法。它假设你正在读取数值数据，如果不是，请在对 scan 的调用中包括以下内容：

what = character()

请在末尾的测试文件中添加。

请注意，我跳过了10行，而不是1000行。

fl <- "~/Temp/so.csv"
sep = ","
skip <- 10L
vec <- NULL
skp <- 0L
x <- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) > 0L) {
  vec <- c(vec, x)
  skp <- skp + skip
  x <- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#> [1]  1 11 21 31

^{创建于2023-06-08，使用 reprex v2.0.2。}

数据

这是测试文件的内容（40行）。

1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60

英文:

Here is a way with scan. It assumes you are reading numeric data, if not include

what = character()

in the calls to scan. Test file at end.
Note that I'm skipping 10 lines, not 1000.

fl &lt;- &quot;~/Temp/so.csv&quot;
sep = &quot;,&quot;
skip &lt;- 10L
vec &lt;- NULL
skp &lt;- 0L
x &lt;- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) &gt; 0L) {
  vec &lt;- c(vec, x)
  skp &lt;- skp + skip
  x &lt;- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#&gt; [1]  1 11 21 31

<sup>Created on 2023-06-08 with reprex v2.0.2</sup>

Data

This is the contents of the test file (40 rows).

1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

读取CSV文件时出现多次跳过

问题

答案1

数据

Data

绘图列表包含多个副本的最后一个绘图。

如何在R中处理大型数据集而不会耗尽内存？

更快的将大型嵌套XML转换为R数据框的方法

State Map by Intervals – Choropleth map R 州际地图按区间 – Choropleth 地图 R

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。