2021年9月16日 16:15:39go评论74阅读模式

英文:

Restart reading csv file from a defined position

问题

我需要在Go语言中处理一个大文件，所以不想一次性加载所有的csv文件行，而是按组进行处理。

为了从上次离开的地方重新计算行的处理，我实际上使用了一个for循环来跳过已经读取的行：

for idx := 0; idx < startAt; idx++ {
    //读取行并对返回值不做任何操作
	if _, readErr := reader.Read(); readErr != nil {
		if readErr == io.EOF {
			//文件结束 -> OK
			isEOF = true
			break
		} else {
			//读取失败
			return nil, errors.New(DATA_READ_ERROR)
		}
	}
}

这是一个相当简单的解决方案；然而，显然效率很低。在读取了前几行之后，读取后续行的时间呈指数增长。

为了减少这个时间，我尝试了不同的替代方案，但每一个都不能正常工作，会导致读取器失败（行没有从正确的位置读取）。

例如，我尝试返回文件指针的当前位置（使用file.Seek(0, io.SeekCurrent)），然后在新的迭代中尝试使用file.Seek(oldPosition, io.SeekStart)来移动指针，但结果并不如预期。

有没有一种方法可以避免上述循环，并在从上次离开的地方重新开始时提高读取时间？

更新

我使用的文件Seek方法非常简单。

//计算数据

func computeData(nrows int, startAt int64) {
	//打开文件
	if csvFile, openErr := os.Open(config.DataSrcFile); openErr == nil {
		//创建读取器
		reader := csv.NewReader(csvFile)
		//将文件指针定位到起始点
		file.Seek(startAt, io.SeekStart)
		//读取n行
		for idx := 0; idx < *nrows && !isEOF; idx++ {
			if csvLine, readErr := reader.Read(); readErr == nil {
				//处理数据...
			} else {
				//读取csv时发生错误
				if readErr == io.EOF {
					//文件结束 -> OK
					break
				} else {
					//返回错误
				}
			}
		}
		//返回读取的字节数（实际上是简化的，实际情况下不会忽略错误）
		bytesRead, _ := file.Seek(0, io.SeekCurrent)
		return bytesRead
	}
}
func main() {
	var startAt int64 = 0
	nrows := 1000
	for !isMyConditionMatched {
		bytesRead = computeData(nrows, startAt)
		startAt += bytesRead
	}
}

英文:

I need to process a big file in Go, so I don't want to load all the rows of my csv file at once but processing them by groups.

To restart the computation of the rows from where I left, I actually use a for cycle to skip the rows already read:

for idx := 0; idx &lt; startAt; idx++ {
    //Read rows and do nothing with the returned value
	if _, readErr := reader.Read(); readErr != nil {
		if readErr == io.EOF {
			//File end -&gt; OK
			isEOF = true
			break
		} else {
			//Read failed
			return nil, errors.New(DATA_READ_ERROR)
		}
	}
}

This is a pretty simple solution; however, it is obviously inefficient. After reading the first lines the time to read the following increases exponentially.

To reduce this time I tried different alternatives, but every one of them doesn't work properly and makes the reader fails (rows are not read from the right address).

For instance, I tried to return the current position of the file pointer (using file.Seek(0, io.SeekCurrent) and then, on the new iteration, I tried to move the pointer using file.Seek(oldPosition, io.SeekStart) but it didn't work as expected.

There is a way to avoid the loop above and improve the reading time when restarting from where I left?

Update

The way I used file Seek is very simple.

//compute data

func computeData(nrows int, startAt int64) {
	//Open file
	if csvFile, openErr := os.Open(config.DataSrcFile); openErr == nil {
		//Create a reader
		reader := csv.NewReader(csvFile)
		//Position the file pointer to the start point
		file.Seek(startAt, io.SeekStart)
		//Read n rows
		for idx := 0; idx &lt; *nrows &amp;&amp; !isEOF; idx++ {
			if csvLine, readErr := reader.Read(); readErr == nil {
				//Do stuff...
			} else {
				//Error registered reading csv
				if readErr == io.EOF {
					//File end -&gt; OK
					break
				} else {
					//Return error
				}
			}
		}
		//Return bytes read (actually simplified, in real case error is not
		// ignored)
		bytesRead, _ := file.Seek(0, io.SeekCurrent)
		return bytesRead
	}
}
func main() {
	var startAt int64 = 0
	nrows := 1000
	for !isMyConditionMatched {
		bytesRead = computeData(nrows, startAt)
		startAt += bytesRead
	}
}

答案1

得分: 1

问题在于encoding/csv内部使用了一个缓冲读取器，所以当你执行file.Seek(0, io.SeekCurrent)时，你得到的是底层文件的位置，但是一些数据已经被读取了而你没有使用它。

有两种可能的解决方案：

一种是使用更低级别的实现，允许精确控制位置。
另一种是找出有多少缓冲数据。

我将向你展示第二个选项的实现（请注意，这依赖于对encoding/csv包内部工作原理的一些了解，如果它发生了变化，可能会停止工作）。

首先，在创建csv之前，你需要创建一个新的缓冲io读取器：

        //将文件指针定位到起始位置
		file.Seek(startAt, io.SeekStart)
		bReader := bufio.NewReader(file)

		//创建一个读取器
		reader := csv.NewReader(bReader)

这将允许你访问缓冲区。你可以像之前一样使用这个读取器，但是最后你可以通过以下方式计算文件的最终位置：

		bufSize := bReader.Buffered()
		filePos, err := file.Seek(0, io.SeekCurrent)
		return filePos - int64(bufSize)

这将获取文件中的当前位置，并减去创建的缓冲区的大小。

请注意，返回的值是文件中的位置，而不是此函数调用中读取的字节数量。

英文:

The problem here is that encoding/csv internally uses a buffered reader, so when you execute file.Seek(0, io.SeekCurrent) you get the position on the underlying file but some data was read and you did not use it.

There are two possible solutions:

one is to use lower level implementations that allow to control exactly where you are
the other is to find out how much buffered data there is.

I'll show you an implementation of the second option (note that this relies on some knowledge of the internal working of the encoding/csv package and may stop working if it is changed)

First you create a new buffered io reader before creating the csv:

        //Position the file pointer to the start point
		file.Seek(startAt, io.SeekStart)
		bReader := bufio.NewReader(file)

		//Create a reader
		reader := csv.NewReader(bReader)

This will allow you to access the buffer. You can use this reader as you already do, but in the end you calculate the final position on the file by doing:

		bufSize := bReader.Buffered()
		filePos, err := file.Seek(0, io.SeekCurrent)
		return filePos - int64(bufSize)

This takes the current position in the file and removes the buffer that was created.

Note that the value returned is the position in the file and not the amount of bytes read in this call to the function.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

重新从指定位置开始读取 CSV 文件。

问题

答案1

如何将这个 Go 语言的 CRC32 块正确地翻译成 JavaScript？

为什么使用`[:0]`时切片的容量不会减少？

GAE Cloud Endpoints API Explorer停止工作了。

string vs []byte type definition

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论