在Go语言中通过多线程以分块的方式下载文件。

huangapple go评论122阅读模式
英文:

Download files by chunks in multiple threads in Go

问题

我需要分块多线程下载文件。
例如,我有1k个文件,每个文件大小在100Mb-1Gb之间,我只能通过每个4kb的块(每个HTTP GET请求只返回4kb)来下载这些文件。

如果用一个线程下载可能太慢了,所以我想要用20个线程(每个线程下载一个文件)来下载它们,并且我还需要在每个线程中同时下载几个块。

有没有示例展示这样的逻辑?

英文:

I need to download files, chunk by chunk in multiple threads.
For example, I have 1k files, each file ~100Mb-1Gb and I can download these files only by chunks 4096Kb(each http get request gives me only 4kb).

It might be to long to download it in one thread, so I want to download them, let's say in 20 threads(one thread for one file) and I also need to download a few chunks in each of these threads, simultaneously.

Is there any example that shows such logic?

答案1

得分: 7

这是一个设置并发下载器的示例。需要注意的是带宽、内存和磁盘空间。同时进行太多下载可能会耗尽带宽和内存。由于下载的是相当大的文件,内存可能成为一个问题。另一个需要注意的是,使用goroutines会导致请求的顺序丢失。所以如果返回的字节顺序很重要,那么这种方法将不起作用,因为你需要知道字节的顺序来组装文件,这意味着最好一次只下载一个文件,除非你实现一种方法来跟踪顺序(可能是一种带有互斥锁的全局map[order int][]bytes)。另一种不涉及Go的替代方法(假设你有一个Unix机器方便)是使用Curl,在这里可以看到http://osxdaily.com/2014/02/13/download-with-curl/

package main

import (
	"bytes"
	"fmt"
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"sync"
)

// 现在你需要小心,因为同时下载太多文件可能会耗尽内存...
// 但是这里有一个可以修改的示例
func downloader(wg *sync.WaitGroup, sema chan struct{}, fileNum int, URL string) {
	sema <- struct{}{}
	defer func() {
		<-sema
		wg.Done()
	}()

	client := &http.Client{Timeout: 10}
	res, err := client.Get(URL)
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()
	var buf bytes.Buffer
	// 在将其写入文件之前,我将其复制到缓冲区中
	// 我也可以直接使用IO copy将其写入文件
	// 并通过直接转储到磁盘来节省内存。
	io.Copy(&buf, res.Body)
	// 将字节写入文件
	ioutil.WriteFile(fmt.Sprintf("file%d.txt", fileNum), buf.Bytes(), 0644)
	return
}

func main() {
	links := []string{
		"url1",
		"url2", // 等等...
	}
	var wg sync.WaitGroup
	// 限制同时下载的数量为四个,这被称为信号量
	limiter := make(chan struct{}, 4)
	for i, link := range links {
		wg.Add(1)
		go downloader(&wg, limiter, i, link)
	}
	wg.Wait()

}
英文:

This is an example of how to set up a concurrent downloader. Things to be aware of are bandwidth, memory, and disk space. You can kill your bandwidth by trying to do to much at once, the same goes for memory. Your downloading pretty big files so memory can be an issue. Another thing to note is that by using gorountines you are losing request order. So if the order of the returned bytes matter, then this will not work because you will have to know the byte order to assemble the file in the end, which would mean that a downloading one at a time is best, unless you implement a way to keep track of the order (maybe some kind of global map[order int][]bytes with mutex to prevent race conditions). An alternative that doesn't involve Go (assuming you have a unix machine for ease) is to use Curl see here http://osxdaily.com/2014/02/13/download-with-curl/

package main

import (
	&quot;bytes&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;io/ioutil&quot;
	&quot;log&quot;
	&quot;net/http&quot;
	&quot;sync&quot;
)

// now your going to have to be careful because you can potentially run out of memory downloading to many files at once..
// however here is an example that can be modded
func downloader(wg *sync.WaitGroup, sema chan struct{}, fileNum int, URL string) {
	sema &lt;- struct{}{}
	defer func() {
		&lt;-sema
		wg.Done()
	}()

	client := &amp;http.Client{Timeout: 10}
	res, err := client.Get(URL)
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()
	var buf bytes.Buffer
    // I&#39;m copying to a buffer before writing it to file
    // I could also just use IO copy to write it to the file
    // directly and save memory by dumping to the disk directly.
	io.Copy(&amp;buf, res.Body)
	// write the bytes to file
	ioutil.WriteFile(fmt.Sprintf(&quot;file%d.txt&quot;, fileNum), buf.Bytes(), 0644)
	return
}

func main() {
	links := []string{
		&quot;url1&quot;,
		&quot;url2&quot;, // etc...
	}
	var wg sync.WaitGroup
	// limit to four downloads at a time, this is called a semaphore
	limiter := make(chan struct{}, 4)
	for i, link := range links {
        wg.Add(1)
		go downloader(&amp;wg, limiter, i, link)
	}
	wg.Wait()

}

答案2

得分: 0

你可以在aws-go-sdk中查看实现。
https://github.com/aws/aws-sdk-go-v2/blob/main/feature/s3/manager/download.go

  1. 创建n个并发的Go协程。
  2. 在这n个协程中下载分块数据。
  3. 使用writer.WriteAt()函数来定位并写入指定位置。
英文:

You can check the implementation in aws-go-sdk.
https://github.com/aws/aws-sdk-go-v2/blob/main/feature/s3/manager/download.go

  1. create n concurrent go routine
  2. download the chunks in the n go routine
  3. use writer.WriteAt() to seek and write to certain position.

huangapple
  • 本文由 发表于 2017年8月3日 07:03:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/45472324.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定