英文:
How to chunk a file into 4 equal files
问题
我有一个非常大的文件,例如100MB,我需要使用Golang将其分成4个25MB的文件。
问题在于,如果我使用Go协程读取文件,文件内部的数据顺序将无法保留。我使用的代码如下:
package main
import (
"bufio"
"fmt"
"log"
"os"
"sync"
"github.com/google/uuid"
)
func main() {
file, err := os.Open("sampletest.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
lines := make(chan string)
// 启动四个工作协程来处理繁重的任务
wc1 := startWorker(lines)
wc2 := startWorker(lines)
wc3 := startWorker(lines)
wc4 := startWorker(lines)
scanner := bufio.NewScanner(file)
go func() {
defer close(lines)
for scanner.Scan() {
lines <- scanner.Text()
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}()
writefiles(wc1, wc2, wc3, wc4)
}
func writefile(data string) {
file, err := os.Create("chunks/" + uuid.New().String() + ".txt")
if err != nil {
fmt.Println(err)
}
defer file.Close()
file.WriteString(data)
}
func startWorker(lines <-chan string) <-chan string {
finished := make(chan string)
go func() {
defer close(finished)
for line := range lines {
finished <- line
}
}()
return finished
}
func writefiles(cs ...<-chan string) {
var wg sync.WaitGroup
output := func(c <-chan string) {
var d string
for n := range c {
d += n
d += "\n"
}
writefile(d)
wg.Done()
}
wg.Add(len(cs))
for _, c := range cs {
go output(c)
}
go func() {
wg.Wait()
}()
}
使用这段代码,我的文件被分成了4个相等大小的文件,但是其中的顺序没有保留。
我对Golang非常陌生,非常感谢任何建议。
我从某个网站上找到了这段代码,并进行了一些修改以满足我的需求。
英文:
I have a file of huge size for example 100MB, I need to chunk it into 4 25MB files using golang.
The thing here is, if i use go routine and read the file, the order of the data inside the files are not preserved. the code i used is
package main
import (
"bufio"
"fmt"
"log"
"os"
"sync"
"github.com/google/uuid"
)
func main() {
file, err := os.Open("sampletest.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
lines := make(chan string)
// start four workers to do the heavy lifting
wc1 := startWorker(lines)
wc2 := startWorker(lines)
wc3 := startWorker(lines)
wc4 := startWorker(lines)
scanner := bufio.NewScanner(file)
go func() {
defer close(lines)
for scanner.Scan() {
lines <- scanner.Text()
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}()
writefiles(wc1, wc2, wc3, wc4)
}
func writefile(data string) {
file, err := os.Create("chunks/" + uuid.New().String() + ".txt")
if err != nil {
fmt.Println(err)
}
defer file.Close()
file.WriteString(data)
}
func startWorker(lines <-chan string) <-chan string {
finished := make(chan string)
go func() {
defer close(finished)
for line := range lines {
finished <- line
}
}()
return finished
}
func writefiles(cs ...<-chan string) {
var wg sync.WaitGroup
output := func(c <-chan string) {
var d string
for n := range c {
d += n
d += "\n"
}
writefile(d)
wg.Done()
}
wg.Add(len(cs))
for _, c := range cs {
go output(c)
}
go func() {
wg.Wait()
}()
}
Here using this code my file got split into 4 equal files, but the order in it is not preserved.
I am very new to golang, any suggestions are highly appreciated.
I took this code from some site and tweaked here and there to meet my requirements.
答案1
得分: 1
根据你的陈述,你应该能够将代码从并发运行修改为顺序运行,这比将并发方面应用于现有代码要容易得多。
基本上,你只需要删除并发部分。
无论如何,下面是一个简单的示例,展示了如何实现你想要的效果。我使用你的代码作为基础,然后删除了与并发进程相关的所有内容。
package main
import (
"bufio"
"fmt"
"log"
"os"
"strings"
"github.com/google/uuid"
)
func main() {
split := 4
file, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
texts := make([]string, 0)
for scanner.Scan() {
text := scanner.Text()
texts = append(texts, text)
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
lengthPerSplit := len(texts) / split
for i := 0; i < split; i++ {
if i+1 == split {
chunkTexts := texts[i*lengthPerSplit:]
writefile(strings.Join(chunkTexts, "\n"))
} else {
chunkTexts := texts[i*lengthPerSplit : (i+1)*lengthPerSplit]
writefile(strings.Join(chunkTexts, "\n"))
}
}
}
func writefile(data string) {
file, err := os.Create("chunks-" + uuid.New().String() + ".txt")
if err != nil {
fmt.Println(err)
}
defer file.Close()
file.WriteString(data)
}
英文:
> I took this code from some site and tweaked here and there to meet my requirements.
Based on your statement, you should be able to modify the code from running concurrently to sequentially, it's faaar easier than applying concurrent aspect to existing code.
The work is basically just: remove the concurrent part.
Anyway, below is a simple example of how to achieve what you want. I use your code as the base, and then I remove everything related to concurrent process.
package main
import (
"bufio"
"fmt"
"log"
"os"
"strings"
"github.com/google/uuid"
)
func main() {
split := 4
file, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
texts := make([]string, 0)
for scanner.Scan() {
text := scanner.Text()
texts = append(texts, text)
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
lengthPerSplit := len(texts) / split
for i := 0; i < split; i++ {
if i+1 == split {
chunkTexts := texts[i*lengthPerSplit:]
writefile(strings.Join(chunkTexts, "\n"))
} else {
chunkTexts := texts[i*lengthPerSplit : (i+1)*lengthPerSplit]
writefile(strings.Join(chunkTexts, "\n"))
}
}
}
func writefile(data string) {
file, err := os.Create("chunks-" + uuid.New().String() + ".txt")
if err != nil {
fmt.Println(err)
}
defer file.Close()
file.WriteString(data)
}
答案2
得分: 1
这是一个简单的文件分割器。你可以自己处理剩余的字节,我将剩余的字节添加到第五个文件中。
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, err := os.Open("sample-text-file.txt")
if err != nil {
panic(err)
}
defer file.Close()
// 将文件分成四个块
info, _ := file.Stat()
chunkSize := int(info.Size() / 4)
// 以块大小创建读取器
bufR := bufio.NewReaderSize(file, chunkSize)
// 注意循环范围是长度为5的切片,前4个块将被写入到第五个文件中
for i := range [5]int{} {
reader := make([]byte, chunkSize)
rlen, err := bufR.Read(reader)
fmt.Println("读取: ", rlen)
if err != nil {
panic(err)
}
writeFile(i, rlen, &reader)
}
}
// 注意 bufW 是一个指针,以避免交换大字节切片
func writeFile(i int, rlen int, bufW *[]byte) {
fname := fmt.Sprintf("file_%v", i)
f, err := os.Create(fname)
defer f.Close()
w := bufio.NewWriterSize(f, rlen)
wbytes := *(bufW)
wLen, err := w.Write(wbytes[:rlen])
if err != nil {
panic(err)
}
fmt.Println("写入 ", wLen, "到", fname)
w.Flush()
}
希望对你有帮助!
英文:
Here is a simple file splitter. You can handle the leftovers yourself, I added the leftover bytes to 5th file.
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, err := os.Open("sample-text-file.txt")
if err != nil {
panic(err)
}
defer file.Close()
// to divide file in four chunks
info, _ := file.Stat()
chunkSize := int(info.Size() / 4)
// reader of chunk size
bufR := bufio.NewReaderSize(file, chunkSize)
// Notice the range over slice of len 5, after 4 leftover will be written to 5th file
for i := range [5]int{} {
reader := make([]byte, chunkSize)
rlen, err := bufR.Read(reader)
fmt.Println("Read: ", rlen)
if err != nil {
panic(err)
}
writeFile(i, rlen, &reader)
}
}
// Notice bufW as a pointer to avoid exchange of big byte slices
func writeFile(i int, rlen int, bufW *[]byte) {
fname := fmt.Sprintf("file_%v", i)
f, err := os.Create(fname)
defer f.Close()
w := bufio.NewWriterSize(f, rlen)
wbytes := *(bufW)
wLen, err := w.Write(wbytes[:rlen])
if err != nil {
panic(err)
}
fmt.Println("Wrote ", wLen, "to", fname)
w.Flush()
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论