重新从指定位置开始读取 CSV 文件。

huangapple go评论88阅读模式
英文:

Restart reading csv file from a defined position

问题

我需要在Go语言中处理一个大文件,所以不想一次性加载所有的csv文件行,而是按组进行处理。

为了从上次离开的地方重新计算行的处理,我实际上使用了一个for循环来跳过已经读取的行:

for idx := 0; idx < startAt; idx++ {
    //读取行并对返回值不做任何操作
	if _, readErr := reader.Read(); readErr != nil {
		if readErr == io.EOF {
			//文件结束 -> OK
			isEOF = true
			break
		} else {
			//读取失败
			return nil, errors.New(DATA_READ_ERROR)
		}
	}
}

这是一个相当简单的解决方案;然而,显然效率很低。在读取了前几行之后,读取后续行的时间呈指数增长。

为了减少这个时间,我尝试了不同的替代方案,但每一个都不能正常工作,会导致读取器失败(行没有从正确的位置读取)。

例如,我尝试返回文件指针的当前位置(使用file.Seek(0, io.SeekCurrent)),然后在新的迭代中尝试使用file.Seek(oldPosition, io.SeekStart)来移动指针,但结果并不如预期。

有没有一种方法可以避免上述循环,并在从上次离开的地方重新开始时提高读取时间?

更新

我使用的文件Seek方法非常简单。

//计算数据

func computeData(nrows int, startAt int64) {
	//打开文件
	if csvFile, openErr := os.Open(config.DataSrcFile); openErr == nil {
		//创建读取器
		reader := csv.NewReader(csvFile)
		//将文件指针定位到起始点
		file.Seek(startAt, io.SeekStart)
		//读取n行
		for idx := 0; idx < *nrows && !isEOF; idx++ {
			if csvLine, readErr := reader.Read(); readErr == nil {
				//处理数据...
			} else {
				//读取csv时发生错误
				if readErr == io.EOF {
					//文件结束 -> OK
					break
				} else {
					//返回错误
				}
			}
		}
		//返回读取的字节数(实际上是简化的,实际情况下不会忽略错误)
		bytesRead, _ := file.Seek(0, io.SeekCurrent)
		return bytesRead
	}
}
func main() {
	var startAt int64 = 0
	nrows := 1000
	for !isMyConditionMatched {
		bytesRead = computeData(nrows, startAt)
		startAt += bytesRead
	}
}
英文:

I need to process a big file in Go, so I don't want to load all the rows of my csv file at once but processing them by groups.

To restart the computation of the rows from where I left, I actually use a for cycle to skip the rows already read:

for idx := 0; idx &lt; startAt; idx++ {
    //Read rows and do nothing with the returned value
	if _, readErr := reader.Read(); readErr != nil {
		if readErr == io.EOF {
			//File end -&gt; OK
			isEOF = true
			break
		} else {
			//Read failed
			return nil, errors.New(DATA_READ_ERROR)
		}
	}
}

This is a pretty simple solution; however, it is obviously inefficient. After reading the first lines the time to read the following increases exponentially.

To reduce this time I tried different alternatives, but every one of them doesn't work properly and makes the reader fails (rows are not read from the right address).

For instance, I tried to return the current position of the file pointer (using file.Seek(0, io.SeekCurrent) and then, on the new iteration, I tried to move the pointer using file.Seek(oldPosition, io.SeekStart) but it didn't work as expected.

There is a way to avoid the loop above and improve the reading time when restarting from where I left?

Update

The way I used file Seek is very simple.

//compute data

func computeData(nrows int, startAt int64) {
	//Open file
	if csvFile, openErr := os.Open(config.DataSrcFile); openErr == nil {
		//Create a reader
		reader := csv.NewReader(csvFile)
		//Position the file pointer to the start point
		file.Seek(startAt, io.SeekStart)
		//Read n rows
		for idx := 0; idx &lt; *nrows &amp;&amp; !isEOF; idx++ {
			if csvLine, readErr := reader.Read(); readErr == nil {
				//Do stuff...
			} else {
				//Error registered reading csv
				if readErr == io.EOF {
					//File end -&gt; OK
					break
				} else {
					//Return error
				}
			}
		}
		//Return bytes read (actually simplified, in real case error is not
		// ignored)
		bytesRead, _ := file.Seek(0, io.SeekCurrent)
		return bytesRead
	}
}
func main() {
	var startAt int64 = 0
	nrows := 1000
	for !isMyConditionMatched {
		bytesRead = computeData(nrows, startAt)
		startAt += bytesRead
	}
}

答案1

得分: 1

问题在于encoding/csv内部使用了一个缓冲读取器,所以当你执行file.Seek(0, io.SeekCurrent)时,你得到的是底层文件的位置,但是一些数据已经被读取了而你没有使用它。

有两种可能的解决方案:

  • 一种是使用更低级别的实现,允许精确控制位置。
  • 另一种是找出有多少缓冲数据。

我将向你展示第二个选项的实现(请注意,这依赖于对encoding/csv包内部工作原理的一些了解,如果它发生了变化,可能会停止工作)。

首先,在创建csv之前,你需要创建一个新的缓冲io读取器:

        //将文件指针定位到起始位置
		file.Seek(startAt, io.SeekStart)
		bReader := bufio.NewReader(file)

		//创建一个读取器
		reader := csv.NewReader(bReader)

这将允许你访问缓冲区。你可以像之前一样使用这个读取器,但是最后你可以通过以下方式计算文件的最终位置:

		bufSize := bReader.Buffered()
		filePos, err := file.Seek(0, io.SeekCurrent)
		return filePos - int64(bufSize)

这将获取文件中的当前位置,并减去创建的缓冲区的大小。

请注意,返回的值是文件中的位置,而不是此函数调用中读取的字节数量。

英文:

The problem here is that encoding/csv internally uses a buffered reader, so when you execute file.Seek(0, io.SeekCurrent) you get the position on the underlying file but some data was read and you did not use it.

There are two possible solutions:

  • one is to use lower level implementations that allow to control exactly where you are
  • the other is to find out how much buffered data there is.

I'll show you an implementation of the second option (note that this relies on some knowledge of the internal working of the encoding/csv package and may stop working if it is changed)

First you create a new buffered io reader before creating the csv:

        //Position the file pointer to the start point
		file.Seek(startAt, io.SeekStart)
		bReader := bufio.NewReader(file)

		//Create a reader
		reader := csv.NewReader(bReader)

This will allow you to access the buffer. You can use this reader as you already do, but in the end you calculate the final position on the file by doing:

		bufSize := bReader.Buffered()
		filePos, err := file.Seek(0, io.SeekCurrent)
		return filePos - int64(bufSize)

This takes the current position in the file and removes the buffer that was created.

Note that the value returned is the position in the file and not the amount of bytes read in this call to the function.

huangapple
  • 本文由 发表于 2021年9月16日 16:15:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/69204739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定