读取记录时,大文件的内容损坏了。

huangapple go评论77阅读模式
英文:

Contents of large file getting corrupted while reading records sequentially

问题

我有一个文件,大约有8500万个JSON记录。文件大小约为110 GB。我想按顺序以100万条记录的批次从该文件中读取。我正在尝试使用扫描器逐行读取文件,并将这100万条记录追加到一个切片中。以下是我正在做的代码要点:

var rawBatch []string
batchSize := 1000000

file, err := os.Open(filePath)
if err != nil {
	// 错误处理
}

scanner := bufio.NewScanner(file)

for scanner.Scan() {
	rec := string(scanner.Bytes())
	rawBatch = append(rawBatch, rec)

	if len(rawBatch) == batchSize {
		for i := 0; i < batchSize; i++ {
			var tRec parsers.TRecord
			err := json.Unmarshal([]byte(rawBatch[i]), &tRec)
			if err != nil {
				// 在这里抛出错误
			}
		}
		// 处理
		rawBatch = nil
	}
}
file.Close()

正确记录的示例:

type TRecord struct {
	Key1 string `json:"key1"`
	Key2 string `json:"key2"`
}

{"key1":"15","key2":"21"}

我在这里遇到的问题是,在读取这些记录时,有些记录会损坏,例如:将冒号改为分号,或将双引号改为#。出现以下错误:

无法加载记录无法加载以下记录
{"key1":#15","key2":"21"}
无效字符 '#'正在寻找值的开始

一些观察结果:

  1. 一旦我们开始读取,文件本身的内容就会损坏。
  2. 对于每批100万条记录,我发现有1个(或最多2个)记录损坏。在8400万条记录中,共有95条记录损坏。
  3. 我的代码对于大小约为42 GB(2300万条记录)的文件有效。对于更大的数据文件,我的代码表现不正常。
  4. 冒号(:)变为分号(;)。双引号变为#。空格变为!。所有这些组合在它们的二进制表示中,只有一个位的差异。是否有可能发生了一些意外的位操作?

对于为什么会发生这种情况,以及如何修复它,有什么想法吗?

详细信息:

  • 使用的Go版本:go1.15.6 darwin/amd64
  • 硬件详细信息:Debian GNU/Linux 9.12(stretch),224 GB RAM,896 GB硬盘
英文:

I have a file, with around 85 million json records. The file size is around 110 Gb. I want to read from this file in batches of 1 million (in sequence). I am trying to read from this file line by line using a scanner, and appending these 1 million records. Here is the code gist of what I am doing:

var rawBatch []string
batchSize := 1000000

file, err := os.Open(filePath)
if err != nil {
	// error handling
}

scanner = bufio.NewScanner(file)

for scanner.Scan() {
	rec := string(scanner.Bytes())
	rawBatch = append(rawBatch, string(recBytes))

	if len(rawBatch) == batchSize {
        for i := 0; i &lt; batchSize ; i++ {
			var tRec parsers.TRecord
			err := json.Unmarshal(rawBatch[i], &amp;tRec)
			if err != nil {
               // Error thrown here
			}
		}
		//process
		rawBatch = nil
	}
}
file.Close()

Sample of correct record:

type TRecord struct {
	Key1         string            `json:&quot;key1&quot;`
	key2 		 string            `json:&quot;key2&quot;`
}

{&quot;key1&quot;:&quot;15&quot;,&quot;key2&quot;:&quot;21&quot;}

The issue I am facing here is that while reading these records, some of these records are getting corrupted, example: changing a colon to semi colon, or double quote to #. Getting this error:

Unable to load Record: Unable to load record in:
 {&quot;key1&quot;:#15&quot;,&quot;key2&quot;:&quot;21&quot;}
invalid character &#39;#&#39; looking for beginning of value

Some observations:

  1. Once we start reading, the contents of the file itself get corrupted.
  2. For every batch of 1 million, I saw 1 (or max 2) records getting corrupted. Out of 84 million records, a total of 95 records were corrupted.
  3. My code is working for for a file with size around 42Gb (23 million records). With a higher sized data file, my code is behaving erroneously.
  4. ':' are changing to ';'. Double quotes are changing to '#'. Space is changing to '!'. All these combinations, in their binary representations, have a single bit difference. Any chance that we might have some accidental bit manipulation?

Any ideas on why this is happening? And how can I fix it?

Details:

  • Go version used: go1.15.6 darwin/amd64
  • Hardware details: Debian GNU/Linux 9.12 (stretch), 224Gb RAM, 896Gb Hard disk

答案1

得分: 3

根据评论中@icza的建议,

偶尔出现的、非常罕见的1位变化表明硬件故障(内存、处理器缓存、硬盘)。我建议在另一台计算机上进行测试。

我在其他一些机器上测试了我的代码。现在代码运行得非常正常。看起来这个偶尔出现的罕见位变化,由于某种硬件故障,导致了这个问题。

英文:

As suggested by @icza in the comments,

> That occasional, very rare 1 bit change suggests hardware failure (memory, processor cache, hard disk). I do recommend to test it on another computer.

I tested my code on some other machines. The code is running perfectly fine now. Looks like this occasional rare bit change, due to some hard failure, was causing this issue.

huangapple
  • 本文由 发表于 2021年2月25日 21:52:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/66369874.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定