What is the fastest way to rewrite file with go

huangapple go评论91阅读模式
英文:

What is the fastest way to rewrite file with go

问题

我可以帮你翻译这段代码。这段代码使用Go语言实现了将一个包含各种大小字符串的大文件(无法完全放入内存)重新写入另一个文件的功能,但是每个字符串都转换为大写。你想知道如何以最快的方式实现这个功能。

以下是我能想到的最有效的方法。有没有什么办法可以让它运行得更快呢?

package main

import (
	"bufio"
	"log"
	"os"
	"strings"
)

func main() {
	inputFile, err := os.Open("input.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer inputFile.Close()

	outputFile, err := os.Create("output.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer outputFile.Close()

	scanner := bufio.NewScanner(inputFile)
	writer := bufio.NewWriter(outputFile)

	for scanner.Scan() {
		line := scanner.Text()
		capitalized := strings.ToUpper(line)
		_, err := writer.WriteString(capitalized + "\n")
		if err != nil {
			log.Fatal(err)
		}
	}

	err = writer.Flush()
	if err != nil {
		log.Fatal(err)
	}
}

希望对你有帮助!

英文:

I have a large file(can't fit entirely in memory) containing strings of various sizes. I want to rewrite these strings to another file, but with each string capitalized. What is the fastest way to achieve this in Go?

Here is the most efficient way that I could come up with. Any ideas on how to make it faster?

package main

import (
	"bufio"
	"log"
	"os"
	"strings"
)

func main() {
	inputFile, err := os.Open("input.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer inputFile.Close()

	outputFile, err := os.Create("output.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer outputFile.Close()

	scanner := bufio.NewScanner(inputFile)
	writer := bufio.NewWriter(outputFile)

	for scanner.Scan() {
		line := scanner.Text()
		capitalized := strings.ToUpper(line)
		_, err := writer.WriteString(capitalized + "\\n")
		if err != nil {
			log.Fatal(err)
		}
	}

	err = writer.Flush()
	if err != nil {
		log.Fatal(err)
	}
}

答案1

得分: 1

一种开始的方法是运行Go测试包的基准测试。


对于基准测试数据,我使用了一个包含275,502个单词、大部分为小写字母、3,077,701字节的Linux字典文件:/usr/share/dict/brazilian。鉴于你对文件的描述比较模糊,这是我能找到的最好的文件。为了避免基准测试中的磁盘I/O,我使用bytes.Reader作为io.Reader,并使用ioutil.Discard作为io.Writer

你的代码的结果:

$ go test upper_so_test.go -run=! -benchmem -bench=.
BenchmarkSO-12   48  22765120 ns/op  8143216 B/op  550993 allocs/op

Blunderific的代码的结果:

BenchmarkB-12    94  13061407 ns/op  3782866 B/op  275505 allocs/op

作为概念验证(PoC),我使用字典文件编写了一段使用最小CPU和内存的代码。到目前为止,我的PoC代码的结果如下:

BenchmarkTU-12  182   6457334 ns/op     8240 B/op       3 allocs/op

将我的PoC代码作为程序运行,使用SSD文件存储来读取和写入字典文件,只需要几毫秒:

$ time ./upper
real	0m0.031s
user	0m0.014s
sys	    0m0.009s

没有你的文件的一个小样本,无法对性能改进做出具体的建议。然而,使用字典文件,我的PoC基准测试结果与你的基准测试结果(6,457,334 ns/op vs. 22,765,120,8,240 B/op vs. 8,143,216,3 allocs/op vs. 550,993)表明你过度使用CPU和内存可能会影响性能。


upper_so_test.go:

package main

import (
	"bufio"
	"bytes"
	"io"
	"io/ioutil"
	"os"
	"strings"
	"testing"
)

func SOToUpper(r io.Reader, w io.Writer) error {
	scanner := bufio.NewScanner(r)
	writer := bufio.NewWriter(w)
	for scanner.Scan() {
		line := scanner.Text()
		capitalized := strings.ToUpper(line)
		_, err := writer.WriteString(capitalized + "\n")
		if err != nil {
			return err
		}
	}
	err := writer.Flush()
	if err != nil {
		return err
	}
	return nil
}

var benchData = func() []byte {
	data, err := os.ReadFile(`/usr/share/dict/brazilian`)
	if err != nil {
		panic(err)
	}
	return data
}()

func BenchmarkSO(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r := bytes.NewReader(benchData)
		w := ioutil.Discard
		err := SOToUpper(r, w)
		if err != nil {
			b.Error(err)
		}
	}
}
英文:

One place to start is to run Go testing package benchmarks.


For benchmark data I use a 275,502 word, largely lowercase, 3,077,701 byte, Linux dictionary file: /usr/share/dict/brazilian. It's the best I could do given your vague description of your file. To avoid benchmark disk I/O, I use bytes.Reader for io.Reader and ioutil.Discard for io.Writer.

The results for your code:

$ go test upper_so_test.go -run=! -benchmem -bench=.
BenchmarkSO-12   48  22765120 ns/op  8143216 B/op  550993 allocs/op

The results for Blunderific's code:

BenchmarkB-12    94  13061407 ns/op  3782866 B/op  275505 allocs/op

As a Proof of Concept (PoC), using the dictionary file, I wrote code which uses minimal CPU and memory. The results, so far, for my PoC code:

BenchmarkTU-12  182   6457334 ns/op     8240 B/op       3 allocs/op

Running my PoC code as a program, using SSD file storage for reading and writing the dictionary file, takes a few milliseconds:

$ time ./upper
real	0m0.031s
user	0m0.014s
sys	    0m0.009s

Without even a small sample of your file, it is not possible to make concrete recommendations for performance improvement. However, using the dictionary file, my PoC benchmark results versus your benchmark results (6,457,334 ns/op vs. 22,765,120, 8,240 B/op vs. 8,143,216, 3 allocs/op vs. 550,993) do make it likely that your profligate use of CPU and memory is hurting performance.


upper_so_test.go:

package main

import (
    &quot;bufio&quot;
    &quot;bytes&quot;
    &quot;io&quot;
    &quot;io/ioutil&quot;
    &quot;os&quot;
    &quot;strings&quot;
    &quot;testing&quot;
)

func SOToUpper(r io.Reader, w io.Writer) error {
    scanner := bufio.NewScanner(r)
    writer := bufio.NewWriter(w)
    for scanner.Scan() {
	    line := scanner.Text()
	    capitalized := strings.ToUpper(line)
	    _, err := writer.WriteString(capitalized + &quot;\n&quot;)
	    if err != nil {
		    return err
	    }
    }
    err := writer.Flush()
    if err != nil {
	    return err
    }
    return nil
}

var benchData = func () []byte {
    data, err := os.ReadFile(`/usr/share/dict/brazilian`)
    if err != nil {
	    panic(err)
    }
    return data
}()

func BenchmarkSO(b *testing.B) {
    for i := 0; i &lt; b.N; i++ {
	    r := bytes.NewReader(benchData)
        w := ioutil.Discard
	    err := SOToUpper(r, w)
	    if err != nil {
		    b.Error(err)
	    }
    }
}

答案2

得分: 0

在内部循环中,使用[]byte而不是string,以避免从[]bytestring的转换。

Scanner.String()方法将数据复制到一个新的字符串中。
Scanner.Bytes()返回扫描器缓冲区上的切片。

for scanner.Scan() {
    line := scanner.Bytes()
    capitalized := bytes.ToUpper(line)
    _, err := writer.Write(capitalized)
    if err != nil {
        log.Fatal(err)
    }
    err = writer.WriteByte('\n')
    if err != nil {
        log.Fatal(err)
    }
}
英文:

Use []byte instead of string in the inner loop to avoid conversions from []byte to string.

The Scanner.String() method copies the data to a new string.
The Scanner.Bytes() returns a slice on the scanner's buffer.

for scanner.Scan() {
	line := scanner.Bytes() 
	capitalized := bytes.ToUpper(line)
	_, err := writer.Write(capitalized)
	if err != nil {
		log.Fatal(err)
	}
	err = writer.WriteByte(&#39;\n&#39;) 
	if err != nil {
		log.Fatal(err)
	}
}

huangapple
  • 本文由 发表于 2023年4月14日 08:05:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76010680.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定