How to work with large data arrays (over 10MiB) effectively in Go?

huangapple go评论90阅读模式
英文:

How to work with large data arrays (over 10MiB) effectively in Go?

问题

我正在使用go语言从一个服务器下载文件,并在处理文件后将其发送到另一个服务器。

文件的大小可以从1MB到200MB不等。

目前,我的代码非常简单,我正在使用http.Client和bytes.Buffer。
处理这些大文件(100MB到200MB)需要很长时间,而且这样的文件有很多。

经过快速分析,我发现大部分时间都花在了bytes.(*Buffer).grow上。
我该如何创建一个16MB大小的大缓冲区呢?

为了提高代码的效率,我应该做些什么?有关处理大型HTTP请求的一般提示?

编辑

我将解释一下我想要做的事情。
我有一些带有附件的couchdb文档,我想要将它们复制到另一个couchdb实例。
couchdb文档的大小可以从30MB到200MB不等,复制小型(2-10MB)的couchdb文档非常快。

但是通过网络发送文档非常慢。
我目前正在尝试进行分析,并尝试使用@Evan的答案来查看我的问题所在。

英文:

I am working with go to download files from one server and after manipulating the files sending it to another server.

The files size can vary from 1MB to 200MB.

Currently, my code is pretty simple, I am using http.Client and bytes.Buffer . <br/>
It takes lot of time to handle does big files (the 100MB to 200MB) which there is a lot of them.

After a quick profiling, I see that most of the time I do bytes.(*Buffer).grow, <br/>
How can I create big buffers for example for 16MB?

What can I do in order to improve my efficiency of the code? General tips for handling with large http requests?

Edit

I will explain, exactly what I am trying to do.
I have couchdb documents (with attachments) that I am trying to copy to another couchdb instance.
The couchdb documents size can be from 30MB to 200MB, copying tiny (2 - 10MB) couchdb documents - is really fast.

But sending the document over the wire is really slow.
I am currently, trying to profile, and try to use @Evan answer to see what is my problem.

答案1

得分: 5

请看一下bytes.NewBuffer的描述:http://golang.org/pkg/bytes/#NewBuffer

听起来你可以创建一个16MB的字节切片,并使用它来初始化缓冲区。

英文:

Take a look at the description for bytes.NewBuffer: http://golang.org/pkg/bytes/#NewBuffer

Sounds like you can create a 16MB byte slice and use it to initialize the buffer.

答案2

得分: 2

你可以考虑这样一个事实:如果程序只需要复制数据,而不需要将数据保留在内存中,那么就没有必要将数据保留在内存中。

现在,Go标准库的一个强大特性是合理使用接口:http.ResponseBody成员是实现了io.ReadCloser接口的对象,并且它满足http.ClientPost方法的body参数的类型要求。

所以你可以这样操作:

  1. 发起一个请求以获取文档,你将得到一个http.Response实例,它的Body成员的类型是io.ReadCloser

    注意,此时你实际上还没有开始从“源”服务器接收正文,因为要做到这一点,你需要读取Bodyio.ReadCloser

  2. 发起另一个(假设是POST)请求来发送数据,并在发起请求时将第一步得到的Body成员作为参数传递给它。

    当这个请求完成数据传输后,调用该Body成员的Close()方法。

代码示例:

import "net/http"

func Pipe(from, to string) (err error) {
    src, err := http.Get(from)
    if err != nil {
        return
    }
    dst, err := http.Post(to, myPostType, src.Body)
    if err != nil {
        return
    }
    // 现在读取并关闭dst.Body成员。
}

在这段代码中,http.Post会从src.Body中读取数据,然后自己调用Close()方法关闭它。

你可以尝试将bytes.Buffer与上述方法结合使用,以减少系统调用的次数,但只有在普通方法无法工作时才这样做。

英文:

You could consider the fact your program has no need to keep the data in memory if all it needs to do is to copy it.

Now the strong feature of Go's standard library is sensible uses of interfaces: http.Response's Body member is something implementing the io.ReadCloser interface, and that satisfies the type of the body argument of the http.Client's Post method.

So you could roll like this:

  1. Perform a request for the document&mdash;you'll get an instance of http.Response back, which has the Body member of type io.readCloser.

    Note that at this point you haven't actually started receiving the body from the "source" server because to do that you'll have to drain the io.ReadCloser of Body.

  2. Initiate another (supposedly POST) request to send the data, and when making the request supply it that Body member obtained in the first step.

    Once this request is done piping your data, call Close() on that Body member.

Something like this:

import &quot;net/http&quot;

func Pipe(from, to string) (err error) {
    src, err := http.Get(from)
    if err != nil {
        return
    }
    dst, err := http.Post(to, myPostType, src.Body)
    if err != nil {
        return
    }
    // Now read and then Close() the dst.Body member.
}

In this code, http.Post will read from src.Body and then Close() it itself.

You might add bytes.Buffer into the mix in hope to reduce the amount of syscalls performed but don't do that unless the plain method does not work.

答案3

得分: 1

如@Evan已经指出的那样:在创建新缓冲区时,您可以选择初始缓冲区大小。

由于缓冲区的分配非常昂贵(这就是为什么您的grow调用需要很长时间;如果大小不再适合,它们会重新分配),选择正确的缓冲区大小非常重要。选择适合的缓冲区分配策略取决于许多因素。根据您的应用程序配置文件,您可以选择自己的缓冲区增长方法。

您还应考虑回收您的缓冲区以防止堆碎片化:http://blog.cloudflare.com/recycling-memory-buffers-in-go

英文:

As @Evan already pointed out: you can choose an initial buffer size when creating a new buffer.

Since allocation of buffers is so expensive (this is why your grow calls take so long; they re-allocate if the size does not fit anymore), picking the right buffer size is key. Picking the right strategy for buffer allocation depends on a lot of factors. You might choose your own method of growing buffers depending on your application profile.

You should also consider recycling your buffers to prevent heap fragmentation: http://blog.cloudflare.com/recycling-memory-buffers-in-go

huangapple
  • 本文由 发表于 2014年4月1日 04:47:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/22771854.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定