如何将一个文件分成4个相等的文件。

huangapple go评论70阅读模式
英文:

How to chunk a file into 4 equal files

问题

我有一个非常大的文件,例如100MB,我需要使用Golang将其分成4个25MB的文件。

问题在于,如果我使用Go协程读取文件,文件内部的数据顺序将无法保留。我使用的代码如下:

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
	"sync"

	"github.com/google/uuid"
)

func main() {
	file, err := os.Open("sampletest.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	lines := make(chan string)
	// 启动四个工作协程来处理繁重的任务
	wc1 := startWorker(lines)
	wc2 := startWorker(lines)
	wc3 := startWorker(lines)
	wc4 := startWorker(lines)
	scanner := bufio.NewScanner(file)

	go func() {
		defer close(lines)
		for scanner.Scan() {
			lines <- scanner.Text()
		}

		if err := scanner.Err(); err != nil {
			log.Fatal(err)
		}
	}()

	writefiles(wc1, wc2, wc3, wc4)
}

func writefile(data string) {
	file, err := os.Create("chunks/" + uuid.New().String() + ".txt")
	if err != nil {
		fmt.Println(err)
	}
	defer file.Close()
	file.WriteString(data)
}

func startWorker(lines <-chan string) <-chan string {
	finished := make(chan string)
	go func() {
		defer close(finished)
		for line := range lines {
			finished <- line
		}
	}()
	return finished
}

func writefiles(cs ...<-chan string) {
	var wg sync.WaitGroup

	output := func(c <-chan string) {
		var d string
		for n := range c {
			d += n
			d += "\n"
		}
		writefile(d)
		wg.Done()
	}
	wg.Add(len(cs))
	for _, c := range cs {
		go output(c)
	}

	go func() {
		wg.Wait()
	}()
}

使用这段代码,我的文件被分成了4个相等大小的文件,但是其中的顺序没有保留。
我对Golang非常陌生,非常感谢任何建议。

我从某个网站上找到了这段代码,并进行了一些修改以满足我的需求。

英文:

I have a file of huge size for example 100MB, I need to chunk it into 4 25MB files using golang.

The thing here is, if i use go routine and read the file, the order of the data inside the files are not preserved. the code i used is

package main
import (
&quot;bufio&quot;
&quot;fmt&quot;
&quot;log&quot;
&quot;os&quot;
&quot;sync&quot;
&quot;github.com/google/uuid&quot;
)
func main() {
file, err := os.Open(&quot;sampletest.txt&quot;)
if err != nil {
log.Fatal(err)
}
defer file.Close()
lines := make(chan string)
// start four workers to do the heavy lifting
wc1 := startWorker(lines)
wc2 := startWorker(lines)
wc3 := startWorker(lines)
wc4 := startWorker(lines)
scanner := bufio.NewScanner(file)
go func() {
defer close(lines)
for scanner.Scan() {
lines &lt;- scanner.Text()
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}()
writefiles(wc1, wc2, wc3, wc4)
}
func writefile(data string) {
file, err := os.Create(&quot;chunks/&quot; + uuid.New().String() + &quot;.txt&quot;)
if err != nil {
fmt.Println(err)
}
defer file.Close()
file.WriteString(data)
}
func startWorker(lines &lt;-chan string) &lt;-chan string {
finished := make(chan string)
go func() {
defer close(finished)
for line := range lines {
finished &lt;- line
}
}()
return finished
}
func writefiles(cs ...&lt;-chan string) {
var wg sync.WaitGroup
output := func(c &lt;-chan string) {
var d string
for n := range c {
d += n
d += &quot;\n&quot;
}
writefile(d)
wg.Done()
}
wg.Add(len(cs))
for _, c := range cs {
go output(c)
}
go func() {
wg.Wait()
}()
}

Here using this code my file got split into 4 equal files, but the order in it is not preserved.
I am very new to golang, any suggestions are highly appreciated.

I took this code from some site and tweaked here and there to meet my requirements.

答案1

得分: 1

根据你的陈述,你应该能够将代码从并发运行修改为顺序运行,这比将并发方面应用于现有代码要容易得多。

基本上,你只需要删除并发部分。

无论如何,下面是一个简单的示例,展示了如何实现你想要的效果。我使用你的代码作为基础,然后删除了与并发进程相关的所有内容。

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
	"strings"

	"github.com/google/uuid"
)

func main() {
	split := 4

	file, err := os.Open("file.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	texts := make([]string, 0)
	for scanner.Scan() {
		text := scanner.Text()
		texts = append(texts, text)
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}

	lengthPerSplit := len(texts) / split
	for i := 0; i < split; i++ {
		if i+1 == split {
			chunkTexts := texts[i*lengthPerSplit:]
			writefile(strings.Join(chunkTexts, "\n"))
		} else {
			chunkTexts := texts[i*lengthPerSplit : (i+1)*lengthPerSplit]
			writefile(strings.Join(chunkTexts, "\n"))
		}
	}
}

func writefile(data string) {
	file, err := os.Create("chunks-" + uuid.New().String() + ".txt")
	if err != nil {
		fmt.Println(err)
	}
	defer file.Close()
	file.WriteString(data)
}
英文:

> I took this code from some site and tweaked here and there to meet my requirements.

Based on your statement, you should be able to modify the code from running concurrently to sequentially, it's faaar easier than applying concurrent aspect to existing code.

The work is basically just: remove the concurrent part.

Anyway, below is a simple example of how to achieve what you want. I use your code as the base, and then I remove everything related to concurrent process.

package main

import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;os&quot;
	&quot;strings&quot;

	&quot;github.com/google/uuid&quot;
)

func main() {
	split := 4

	file, err := os.Open(&quot;file.txt&quot;)
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	texts := make([]string, 0)
	for scanner.Scan() {
		text := scanner.Text()
		texts = append(texts, text)
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}

	lengthPerSplit := len(texts) / split
	for i := 0; i &lt; split; i++ {
		if i+1 == split {
			chunkTexts := texts[i*lengthPerSplit:]
			writefile(strings.Join(chunkTexts, &quot;\n&quot;))
		} else {
			chunkTexts := texts[i*lengthPerSplit : (i+1)*lengthPerSplit]
			writefile(strings.Join(chunkTexts, &quot;\n&quot;))
		}
	}
}

func writefile(data string) {
	file, err := os.Create(&quot;chunks-&quot; + uuid.New().String() + &quot;.txt&quot;)
	if err != nil {
		fmt.Println(err)
	}
	defer file.Close()
	file.WriteString(data)
}

答案2

得分: 1

这是一个简单的文件分割器。你可以自己处理剩余的字节,我将剩余的字节添加到第五个文件中。

package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	file, err := os.Open("sample-text-file.txt")
	if err != nil {
		panic(err)
	}
	defer file.Close()

	// 将文件分成四个块
	info, _ := file.Stat()
	chunkSize := int(info.Size() / 4)

	// 以块大小创建读取器
	bufR := bufio.NewReaderSize(file, chunkSize)

	// 注意循环范围是长度为5的切片,前4个块将被写入到第五个文件中
	for i := range [5]int{} {
		reader := make([]byte, chunkSize)
		rlen, err := bufR.Read(reader)
		fmt.Println("读取: ", rlen)
		if err != nil {
			panic(err)
		}
		writeFile(i, rlen, &reader)
	}
}

// 注意 bufW 是一个指针,以避免交换大字节切片
func writeFile(i int, rlen int, bufW *[]byte) {
	fname := fmt.Sprintf("file_%v", i)
	f, err := os.Create(fname)
	defer f.Close()

	w := bufio.NewWriterSize(f, rlen)
	wbytes := *(bufW)
	wLen, err := w.Write(wbytes[:rlen])
	if err != nil {
		panic(err)
	}
	fmt.Println("写入 ", wLen, "到", fname)
	w.Flush()
}

希望对你有帮助!

英文:

Here is a simple file splitter. You can handle the leftovers yourself, I added the leftover bytes to 5th file.

package main
import (
&quot;bufio&quot;
&quot;fmt&quot;
&quot;os&quot;
)
func main() {
file, err := os.Open(&quot;sample-text-file.txt&quot;)
if err != nil {
panic(err)
}
defer file.Close()
// to divide file in four chunks
info, _ := file.Stat()
chunkSize := int(info.Size() / 4)
// reader of chunk size
bufR := bufio.NewReaderSize(file, chunkSize)
// Notice the range over slice of len 5, after 4 leftover will be written to 5th file
for i := range [5]int{} {
reader := make([]byte, chunkSize)
rlen, err := bufR.Read(reader)
fmt.Println(&quot;Read: &quot;, rlen)
if err != nil {
panic(err)
}
writeFile(i, rlen, &amp;reader)
}
}
// Notice bufW as a pointer to avoid exchange of big byte slices
func writeFile(i int, rlen int, bufW *[]byte) {
fname := fmt.Sprintf(&quot;file_%v&quot;, i)
f, err := os.Create(fname)
defer f.Close()
w := bufio.NewWriterSize(f, rlen)
wbytes := *(bufW)
wLen, err := w.Write(wbytes[:rlen])
if err != nil {
panic(err)
}
fmt.Println(&quot;Wrote &quot;, wLen, &quot;to&quot;, fname)
w.Flush()
}

huangapple
  • 本文由 发表于 2021年8月20日 20:15:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/68862107.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定