忽略Go中长文本文件中包含某个模式的行。

huangapple go评论83阅读模式
英文:

Ignore a line containing a pattern from a long text file in Go

问题

我正在尝试在Go中实现一个函数,用于忽略包含特定模式的行。我有以下两个函数withoutIgnorewithIgnore,它们都接受一个文件名参数,并返回一个*byte.Buffer,可以用于写入到io.Writer

withIgnore函数接受一个额外的参数pattern,用于从文件中排除包含该模式的行。该函数可以正常工作,但在进行基准测试时,发现它比withoutIgnore函数慢5倍。有没有办法改进它呢?

package main

import (
	"bufio"
	"bytes"
	"io"
	"log"
	"os"
)

func withoutIgnore(f string) (*bytes.Buffer, error) {
	rfd, err := os.Open(f)
	if err != nil {
		log.Fatal(err)
	}

	defer func() {
		if err := rfd.Close(); err != nil {
			log.Fatal(err)
		}
	}()

	inputBuffer := make([]byte, 1048576)
	var bytesRead int

	var bs []byte
	opBuffer := bytes.NewBuffer(bs)

	for {
		bytesRead, err = rfd.Read(inputBuffer)

		if err == io.EOF {
			return opBuffer, nil
		}

		if err != nil {
			return nil, nil
		}

		_, err = opBuffer.Write(inputBuffer[:bytesRead])
		if err != nil {
			return nil, err
		}
	}
	return opBuffer, nil
}

func withIgnore(f, pattern string) (*bytes.Buffer, error) {
	rfd, err := os.Open(f)
	if err != nil {
		log.Fatal(err)
	}

	defer func() {
		if err := rfd.Close(); err != nil {
			log.Fatal(err)
		}
	}()

	scanner := bufio.NewScanner(rfd)
	var bs []byte
	buffer := bytes.NewBuffer(bs)
	for scanner.Scan() {
		if !bytes.Contains(scanner.Bytes(), []byte(pattern)) {
			_, err := buffer.WriteString(scanner.Text() + "\n")
			if err != nil {
				return nil, nil
			}
		}
	}

	return buffer, nil
}

func main() {
	// buff, err := withoutIgnore("base64dump.log")
	buff, err := withIgnore("base64dump.log", "AUDIT")
	if err != nil {
		log.Fatal(err)
	}

	_, err = buff.WriteTo(os.Stdout)
	if err != nil {
		log.Fatal(err)
	}
}

基准测试代码如下:

package main

import "testing"

func BenchmarkTestWithoutIgnore(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_, err := withoutIgnore("base64dump.log")
		if err != nil {
			b.Fatal(err)
		}
	}
}

func BenchmarkTestWithIgnore(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_, err := withIgnore("base64dump.log", "AUDIT")
		if err != nil {
			b.Fatal(err)
		}
	}
}

可以使用以下命令在命令行中生成base64dump.log文件:

base64 /dev/urandom | head -c 10000000 > base64dump.log
英文:

I'm trying to implement a function to ignore a line containing a pattern from a long text file (ASCII guaranteed) in Go

The functions I have below withoutIgnore and withIgnore, both take a filename argument input and return a *byte.Buffer, which can be subsequently used to write to a io.Writer.

The withIgnore function takes an additional argument pattern to exclude the line containing the pattern from the file. The function works, but with benchmarking, found it to be 5x slower than withoutIgnore. Is there a way it could be improved?

package main

import (
	&quot;bufio&quot;
	&quot;bytes&quot;
	&quot;io&quot;
	&quot;log&quot;
	&quot;os&quot;
)

func withoutIgnore(f string) (*bytes.Buffer, error) {
	rfd, err := os.Open(f)
	if err != nil {
		log.Fatal(err)
	}

	defer func() {
		if err := rfd.Close(); err != nil {
			log.Fatal(err)
		}
	}()

	inputBuffer := make([]byte, 1048576)
	var bytesRead int

	var bs []byte
	opBuffer := bytes.NewBuffer(bs)

	for {
		bytesRead, err = rfd.Read(inputBuffer)

		if err == io.EOF {
			return opBuffer, nil
		}

		if err != nil {
			return nil, nil
		}

		_, err = opBuffer.Write(inputBuffer[:bytesRead])
		if err != nil {
			return nil, err
		}
	}
	return opBuffer, nil
}

func withIgnore(f, pattern string) (*bytes.Buffer, error) {
	rfd, err := os.Open(f)
	if err != nil {
		log.Fatal(err)
	}

	defer func() {
		if err := rfd.Close(); err != nil {
			log.Fatal(err)
		}
	}()

	scanner := bufio.NewScanner(rfd)
	var bs []byte
	buffer := bytes.NewBuffer(bs)
	for scanner.Scan() {
		if !bytes.Contains(scanner.Bytes(), []byte(pattern)) {
			_, err := buffer.WriteString(scanner.Text() + &quot;\n&quot;)
			if err != nil {
				return nil, nil
			}
		}
	}

	return buffer, nil
}

func main() {
	// buff, err := withoutIgnore(&quot;base64dump.log&quot;)
	buff, err := withIgnore(&quot;base64dump.log&quot;, &quot;AUDIT&quot;)
	if err != nil {
		log.Fatal(err)
	}

	_, err = buff.WriteTo(os.Stdout)
	if err != nil {
		log.Fatal(err)
	}
}

Benchmark test

package main

import &quot;testing&quot;

func BenchmarkTestWithoutIgnore(b *testing.B) {
	for i := 0; i &lt; b.N; i++ {
		_, err := withoutIgnore(&quot;base64dump.log&quot;)
		if err != nil {
			b.Fatal(err)
		}
	}
}

func BenchmarkTestWithIgnore(b *testing.B) {
	for i := 0; i &lt; b.N; i++ {
		_, err := withIgnore(&quot;base64dump.log&quot;, &quot;AUDIT&quot;)
		if err != nil {
			b.Fatal(err)
		}
	}
}

and the &quot;base64dump.log&quot; can be generated in the command line using

base64 /dev/urandom | head -c 10000000 &gt; base64dump.log

答案1

得分: 1

由于ASCII是保证的,因此可以直接在字节级别上进行操作。

然而,如果在读取输入时检查每个字节是否为换行符,然后再在行内搜索模式,那么每个字节都会进行操作。

另一方面,如果读取输入的块并在文本中执行优化的模式搜索,甚至不检查每个输入字节,那么每个输入字节的操作就会最小化。

例如,有Boyer-Moore字符串搜索算法。Go的内置bytes.Index函数也经过了优化。当测量时,根据输入数据和实际模式,bytes.Index的速度当然会有所不同。

过程

  • 读取一个块,块的大小应该明显大于最大行长度,一个值>= 64KB可能是不错的选择,在测试中使用的是1MB。
  • 一个块通常不会以换行符结束,所以从块的末尾搜索到下一个换行符,将搜索限制在这个片段中,并记住下一次搜索的剩余数据。
  • 最后一个块不一定以换行符结束。
  • 借助高效的GO函数bytes.Index,可以找到模式在块中出现的位置。
  • 从找到的位置开始搜索前面和后面的换行符。
  • 然后输出块,直到对应行的开头。
  • 并从模式出现的行的末尾继续搜索。
  • 如果搜索没有找到另一个位置,则输出剩余部分。
  • 读取下一个块,并再次应用上述步骤,直到达到文件的末尾。

值得注意的是
读取操作可能返回的数据少于块大小,因此重复读取操作直到读取到块大小的数据是有意义的。

基准测试
优化的代码通常更加复杂,但性能也显著提高,我们马上就会看到。

BenchmarkTestWithoutIgnore-8   	     270	   4137267 ns/op
BenchmarkTestWithIgnore-8      	      54	  22403931 ns/op
BenchmarkTestFilter-8          	     150	   7947454 ns/op

在这里,优化的代码BenchmarkTestFilter-8只比没有过滤的操作慢大约1.9倍,而BenchmarkTestWithIgnore-8方法比没有过滤的比较值慢5.4倍。

换个角度看:优化的代码比未优化的代码快2.8倍

代码
当然,这里是用于您自己测试的代码:

func filterFile(f, pattern string) (*bytes.Buffer, error) {
	rfd, err := os.Open(f)
	if err != nil {
		log.Fatal(err)
	}
	defer func() {
		if err := rfd.Close(); err != nil {
			log.Fatal(err)
		}
	}()

	reader := bufio.NewReader(rfd)
	return filter(reader, []byte(pattern), 1024*1024)
}

// chunkSize must be larger than the longest line
// a reasonable size is probably >= 64K
func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error) {
	var bs []byte
	buffer := bytes.NewBuffer(bs)

	chunk := make([]byte, chunkSize)

	var remaining []byte
	for lastChunk := false; !lastChunk; {
		n, err := readChunk(reader, chunk, remaining, chunkSize)
		if err != nil {
			if err == io.EOF {
				lastChunk = true
			} else {
				return nil, err
			}
		}

		remaining = remaining[:0]
		if !lastChunk {
			for i := n - 1; i > 0; i-- {
				if chunk[i] == '\n' {
					remaining = append(remaining, chunk[i+1:n]...)
					n = i + 1
					break
				}
			}
		}

		s := 0
		for s < n {
			hit := bytes.Index(chunk[s:n], pattern)
			if hit < 0 {
				break
			}
			hit += s
			startOfLine := hit
			for ; startOfLine > 0; startOfLine-- {
				if chunk[startOfLine] == '\n' {
					startOfLine++
					break
				}
			}
			endOfLine := hit + len(pattern)
			for ; endOfLine < n; endOfLine++ {
				if chunk[endOfLine] == '\n' {
					break
				}
			}
			endOfLine++

			_, err = buffer.Write(chunk[s:startOfLine])
			if err != nil {
				return nil, err
			}
			s = endOfLine
		}

		if s < n {
			_, err = buffer.Write(chunk[s:n])
			if err != nil {
				return nil, err
			}
		}
	}

	return buffer, nil
}

func readChunk(reader io.Reader, chunk, remaining []byte, chunkSize int) (int, error) {
	copy(chunk, remaining)
	r := len(remaining)
	for r < chunkSize {
		n, err := reader.Read(chunk[r:])
		r += n
		if err != nil {
			return r, err
		}
	}
	return r, nil
}

基准测试部分可能如下所示:

func BenchmarkTestFilter(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_, err := filterFile("base64dump.log", "AUDIT")
		if err != nil {
			b.Fatal(err)
		}
	}
}

过滤函数被拆分,实际的工作在func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error)中完成。

通过注入一个读取器和一个chunkSize,已经准备好或考虑到了单元测试的创建,这在处理索引时是缺失的,但在处理索引时绝对是推荐的。

然而,这里的主要重点是找到一种在性能方面显著改进的方法。

英文:

Since ASCII is guaranteed, one can work directly at byte level.

Still if one checks each byte for line breaks when reading the input and then searches for the pattern again within the line, operations are applied to each byte.

If, on the other hand, one reads chunks of the input and performs an optimized search for the pattern in the text, not even examining each input byte, one minimizes the operations per input byte.

For example, there is the Boyer-Moore string search algorithm. Go's built-in bytes.Index function is also optimized. The achieved speed depends of course on the input data and the actual pattern. For the input as specified in the question, `bytes.Index turned out to be significantly more performant when measured.

Procedure

  • read in a chunk, where the chunk size should be significantly longer than the maximum line length, a value >= 64KB should probably be good, in the test 1MB was used as in the question.
  • a chunk usually doesn't end at a linefeed, so search from the end of the chunk to the next linefeed, limit the search to this slice and remember the remaining data for the next pass
  • the last chunk does not necessarily end in a linefeed
  • with the help of the performant GO function bytes.Index you can find the places where the pattern occurs in the chunk
  • from the found location one searches for the preceding and the following linefeed
  • then the block is output up to the corresponding beginning of the line
  • and the search is continued from the end of the line where the pattern occurred
  • if the search does not find another location, the rest is output
  • read the next chunk and apply the described steps again until the end of the file is reached

Noteworthy

A read operation may return less data than the chunk size, so it makes sense to repeat the read operation until the chunk size data has been read.

Benchmark

Optimized code is often significantly more complicated, but the performance is also significantly better, as we will see in a moment.

BenchmarkTestWithoutIgnore-8   	     270	   4137267 ns/op
BenchmarkTestWithIgnore-8      	      54	  22403931 ns/op
BenchmarkTestFilter-8          	     150	   7947454 ns/op

Here, the optimized code BenchmarkTestFilter-8 is only about 1.9x slower than the operation without filtering while the BenchmarkTestWithIgnore-8 method is 5.4x slower than the comparison value without filtering.

Looked at another way: the optimized code is 2.8 times faster than the unoptimized one.

Code

Of course, here is the code for your own tests:

func filterFile(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
reader := bufio.NewReader(rfd)
return filter(reader, []byte(pattern), 1024*1024)
}
// chunkSize must be larger than the longest line
// a reasonable size is probably &gt;= 64K
func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error) {
var bs []byte
buffer := bytes.NewBuffer(bs)
chunk := make([]byte, chunkSize)
var remaining []byte
for lastChunk := false; !lastChunk; {
n, err := readChunk(reader, chunk, remaining, chunkSize)
if err != nil {
if err == io.EOF {
lastChunk = true
} else {
return nil, err
}
}
remaining = remaining[:0]
if !lastChunk {
for i := n - 1; i &gt; 0; i-- {
if chunk[i] == &#39;\n&#39; {
remaining = append(remaining, chunk[i+1:n]...)
n = i + 1
break
}
}
}
s := 0
for s &lt; n {
hit := bytes.Index(chunk[s:n], pattern)
if hit &lt; 0 {
break
}
hit += s
startOfLine := hit
for ; startOfLine &gt; 0; startOfLine-- {
if chunk[startOfLine] == &#39;\n&#39; {
startOfLine++
break
}
}
endOfLine := hit + len(pattern)
for ; endOfLine &lt; n; endOfLine++ {
if chunk[endOfLine] == &#39;\n&#39; {
break
}
}
endOfLine++
_, err = buffer.Write(chunk[s:startOfLine])
if err != nil {
return nil, err
}
s = endOfLine
}
if s &lt; n {
_, err = buffer.Write(chunk[s:n])
if err != nil {
return nil, err
}
}
}
return buffer, nil
}
func readChunk(reader io.Reader, chunk, remaining []byte, chunkSize int) (int, error) {
copy(chunk, remaining)
r := len(remaining)
for r &lt; chunkSize {
n, err := reader.Read(chunk[r:])
r += n
if err != nil {
return r, err
}
}
return r, nil
}

And the benchmark part might look something like this:

func BenchmarkTestFilter(b *testing.B) {
for i := 0; i &lt; b.N; i++ {
_, err := filterFile(&quot;base64dump.log&quot;, &quot;AUDIT&quot;)
if err != nil {
b.Fatal(err)
}
}
}

The filter function was split and the actual job is done in func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error).

By injecting a reader and a chunkSize, the creation of unit tests is already prepared or contemplated, which is missing here, but is definitely recommended when dealing with indexes.

However, the main point here was to find a way to significantly improve it in terms of performance.

huangapple
  • 本文由 发表于 2022年12月21日 02:14:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/74867271.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定