缓冲的 golang 通道丢失数据

huangapple go评论79阅读模式
英文:

Buffered golang channel losing data

问题

我正在尝试使用goroutine解析一个巨大的Wiktionary转储文件,并遇到一个奇怪的错误,即goroutine正在读取的通道似乎在每次通道阻塞时丢失和损坏数据。

当我运行这段代码时,我得到了正确的输出。特别要注意第51行和第52行。

然而,当我打印line(即goroutine接收到的内容)时,我得到了下面的输出。在第51行之后,通道阻塞,主函数扫描并将51个值传递给通道。然而,goroutine读取的下一行是错误的,而且更糟糕的是,它明显是格式错误的。

我尝试在Go Playground中复现这个问题,但没有成功——似乎这与在通道中传递切片的方式有关。

英文:

I am trying to parse a huge Wiktionary dump using a goroutine, and am encountering a strange bug where the channel that the goroutine is reading from seems to be losing and corrupting data every time the channel blocks.

func main() {
    inFile, err := os.Open(*srcFile)
    if err != nil {
	    log.LogErrorf("Error opening dump: %v", err)
	    return
    }
    defer inFile.Close()

    var wg sync.WaitGroup
    input := make(chan []byte, 51)


    go func() {
        wg.Add(1)
	    for line := range input {
		    log.Printf("Bytes: %s", line)
		    // process the line
        }
        wg.Done()
    }()

    scanner := bufio.NewScanner(inFile)
    count := 0
    for scanner.Scan() {
        count++
        log.Printf("Scanned: %d", count)
	    if err := scanner.Err(); err != nil {
		    log.LogErrorf("Error scanning: %v", err)
      	}
     	newestBytes := scanner.Bytes()
        log.Printf("Bytes: %s", newestBytes)
	    input <- newestBytes
    }
    close(input)
    wg.Wait()
}

When I run this, I get the correct output. In particular, note lines 51 and 52.

2014/08/03 17:49:25 Scanned: 42
2014/08/03 17:49:25 Bytes:       <namespace key="115" case="case-sensitive">Citations talk</namespace>
2014/08/03 17:49:25 Scanned: 43
2014/08/03 17:49:25 Bytes:       <namespace key="116" case="case-sensitive">Sign gloss</namespace>
2014/08/03 17:49:25 Scanned: 44
2014/08/03 17:49:25 Bytes:       <namespace key="117" case="case-sensitive">Sign gloss talk</namespace>
2014/08/03 17:49:25 Scanned: 45
2014/08/03 17:49:25 Bytes:       <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:49:25 Scanned: 46
2014/08/03 17:49:25 Bytes:       <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:49:25 Scanned: 47
2014/08/03 17:49:25 Bytes:     </namespaces>
2014/08/03 17:49:25 Scanned: 48
2014/08/03 17:49:25 Bytes:   </siteinfo>
2014/08/03 17:49:25 Scanned: 49
2014/08/03 17:49:25 Bytes:   <page>
2014/08/03 17:49:25 Scanned: 50
2014/08/03 17:49:25 Bytes:     <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:49:25 Scanned: 51
2014/08/03 17:49:25 Bytes:     <ns>4</ns>
2014/08/03 17:49:25 Scanned: 52
2014/08/03 17:49:25 Bytes:     <id>6</id>
2014/08/03 17:49:25 Scanned: 53
2014/08/03 17:49:25 Bytes:     <restrictions>edit=autoconfirmed:move=sysop</restrictions>
2014/08/03 17:49:25 Scanned: 54
2014/08/03 17:49:25 Bytes:     <revision>
2014/08/03 17:49:25 Scanned: 55
2014/08/03 17:49:25 Bytes:       <id>24557508</id>
2014/08/03 17:49:25 Scanned: 56
2014/08/03 17:49:25 Bytes:       <parentid>19020708</parentid>
2014/08/03 17:49:25 Scanned: 57
2014/08/03 17:49:25 Bytes:       <timestamp>2013-12-30T13:50:49Z</timestamp>
2014/08/03 17:49:25 Scanned: 58
2014/08/03 17:49:25 Bytes:       <contributor>
2014/08/03 17:49:25 Scanned: 59

Yet when I print line instead (what the goroutine is receiving), I get the output below. After line 51, the channel blocks and main scans and passes 51 more values to the channel. However, the next line that the goroutine reads is incorrect, and more than that, it is clearly malformed.

Bytes:       <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:40:52 Bytes:       <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:40:52 Bytes:     </namespaces>
2014/08/03 17:40:52 Bytes:   </siteinfo>
2014/08/03 17:40:52 Bytes:   <page>
2014/08/03 17:40:52 Bytes:     <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:40:52 Scanned: 52
2014/08/03 17:40:52 Scanned: 53
2014/08/03 17:40:52 Scanned: 54
2014/08/03 17:40:52 Scanned: 55
2014/08/03 17:40:52 Scanned: 56
2014/08/03 17:40:52 Scanned: 57
2014/08/03 17:40:52 Scanned: 58
2014/08/03 17:40:52 Scanned: 59
2014/08/03 17:40:52 Scanned: 60
2014/08/03 17:40:52 Scanned: 61
2014/08/03 17:40:52 Scanned: 62
2014/08/03 17:40:52 Scanned: 63
2014/08/03 17:40:52 Scanned: 64
2014/08/03 17:40:52 Scanned: 65
2014/08/03 17:40:52 Scanned: 66
2014/08/03 17:40:52 Scanned: 67
2014/08/03 17:40:52 Scanned: 68
2014/08/03 17:40:52 Scanned: 69
2014/08/03 17:40:52 Scanned: 70
2014/08/03 17:40:52 Scanned: 71
2014/08/03 17:40:52 Scanned: 72
2014/08/03 17:40:52 Scanned: 73
2014/08/03 17:40:52 Scanned: 74
2014/08/03 17:40:52 Scanned: 75
2014/08/03 17:40:52 Scanned: 76
2014/08/03 17:40:52 Scanned: 77
2014/08/03 17:40:52 Scanned: 78
2014/08/03 17:40:52 Scanned: 79
2014/08/03 17:40:52 Scanned: 80
2014/08/03 17:40:52 Scanned: 81
2014/08/03 17:40:52 Scanned: 82
2014/08/03 17:40:52 Scanned: 83
2014/08/03 17:40:52 Scanned: 84
2014/08/03 17:40:52 Scanned: 85
2014/08/03 17:40:52 Scanned: 86
2014/08/03 17:40:52 Scanned: 87
2014/08/03 17:40:52 Scanned: 88
2014/08/03 17:40:52 Scanned: 89
2014/08/03 17:40:52 Scanned: 90
2014/08/03 17:40:52 Scanned: 91
2014/08/03 17:40:52 Scanned: 92
2014/08/03 17:40:52 Scanned: 93
2014/08/03 17:40:52 Scanned: 94
2014/08/03 17:40:52 Scanned: 95
2014/08/03 17:40:52 Scanned: 96
2014/08/03 17:40:52 Scanned: 97
2014/08/03 17:40:52 Scanned: 98
2014/08/03 17:40:52 Scanned: 99
2014/08/03 17:40:52 Scanned: 100
2014/08/03 17:40:52 Scanned: 101
2014/08/03 17:40:52 Scanned: 102
2014/08/03 17:40:52 Bytes: nd other refer
2014/08/03 17:40:52 Bytes: nce and instru
2014/08/03 17:40:52 Bytes: tional materials. It stipulates that any copy of the material,
2014/08/03 17:40:52 Bytes: even if modifi
2014/08/03 17:40:52 Bytes: d, carry the same licen
2014/08/03 17:40:52 Bytes: e. Those copies may be sold but, if
2014/08/03 17:40:52 Bytes: produced in quantity, have to be made available i
2014/08/03 17:40:52 Bytes:  a format which fac
2014/08/03 17:40:52 Bytes: litates further editing. 

I have tried to reproduce this in the Go playground but I have been unsuccessful - it seems like this is something to do with the way slices are passed in channels.

答案1

得分: 8

函数Scanner.Bytes可能会返回扫描器内部使用的相同切片。

func (s *Scanner) Bytes() []byte

Bytes函数返回最近一次调用Scan时生成的标记。底层数组可能指向在后续调用Scan时将被覆盖的数据。它不进行分配。

根据文档,这个切片可能会被后续对Scanner.Scan的调用覆盖。由于你的代码没有确保在下一次调用Scanner.Scan之后不再使用这个切片(实际上,你的代码异步地生成和消耗行),所以在你尝试使用它的地方可能包含垃圾数据。

为了确保数据不会被后续的Scanner.Scan调用覆盖,可以显式地复制这个切片。

input <- append(nil, newestBytes...)

英文:

The function Scanner.Bytes may return the same slice used internally by the scanner.

> func (s *Scanner) Bytes() []byte
>
> Bytes returns the most recent token generated by a call to Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan. It does no allocation.

As per documentation, this slice may be overwritten by subsequent calls to Scanner.Scan. Since your code does not ensure that this slice is not used after the next call to Scanner.Scan (and in fact your code produces lines and consumes them asynchonously), it may contain garbage at the point where you're trying to use it.

Explicitly copy the slice to make sure that the data is not being overwritten by subsequent calls to Scanner.Scan.

input &lt;- append(nil, newestBytes...)

huangapple
  • 本文由 发表于 2014年8月4日 02:00:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/25107540.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定