2014年4月1日 04:47:16go评论96阅读模式

英文:

How to work with large data arrays (over 10MiB) effectively in Go?

问题

我正在使用go语言从一个服务器下载文件，并在处理文件后将其发送到另一个服务器。

文件的大小可以从1MB到200MB不等。

目前，我的代码非常简单，我正在使用http.Client和bytes.Buffer。
处理这些大文件（100MB到200MB）需要很长时间，而且这样的文件有很多。

经过快速分析，我发现大部分时间都花在了bytes.(*Buffer).grow上。
我该如何创建一个16MB大小的大缓冲区呢？

为了提高代码的效率，我应该做些什么？有关处理大型HTTP请求的一般提示？

编辑

我将解释一下我想要做的事情。
我有一些带有附件的couchdb文档，我想要将它们复制到另一个couchdb实例。
couchdb文档的大小可以从30MB到200MB不等，复制小型（2-10MB）的couchdb文档非常快。

但是通过网络发送文档非常慢。
我目前正在尝试进行分析，并尝试使用@Evan的答案来查看我的问题所在。

英文:

I am working with go to download files from one server and after manipulating the files sending it to another server.

The files size can vary from 1MB to 200MB.

Currently, my code is pretty simple, I am using http.Client and bytes.Buffer . <br/>
It takes lot of time to handle does big files (the 100MB to 200MB) which there is a lot of them.

After a quick profiling, I see that most of the time I do bytes.(*Buffer).grow, <br/>
How can I create big buffers for example for 16MB?

What can I do in order to improve my efficiency of the code? General tips for handling with large http requests?

Edit

I will explain, exactly what I am trying to do.
I have couchdb documents (with attachments) that I am trying to copy to another couchdb instance.
The couchdb documents size can be from 30MB to 200MB, copying tiny (2 - 10MB) couchdb documents - is really fast.

But sending the document over the wire is really slow.
I am currently, trying to profile, and try to use @Evan answer to see what is my problem.

答案1

得分: 5

请看一下bytes.NewBuffer的描述：http://golang.org/pkg/bytes/#NewBuffer

听起来你可以创建一个16MB的字节切片，并使用它来初始化缓冲区。

英文:

Take a look at the description for bytes.NewBuffer: http://golang.org/pkg/bytes/#NewBuffer

Sounds like you can create a 16MB byte slice and use it to initialize the buffer.

答案2

得分: 2

你可以考虑这样一个事实：如果程序只需要复制数据，而不需要将数据保留在内存中，那么就没有必要将数据保留在内存中。

现在，Go标准库的一个强大特性是合理使用接口：http.Response的Body成员是实现了io.ReadCloser接口的对象，并且它满足http.Client的Post方法的body参数的类型要求。

所以你可以这样操作：

发起一个请求以获取文档，你将得到一个http.Response实例，它的Body成员的类型是io.ReadCloser。

注意，此时你实际上还没有开始从“源”服务器接收正文，因为要做到这一点，你需要读取Body的io.ReadCloser。
发起另一个（假设是POST）请求来发送数据，并在发起请求时将第一步得到的Body成员作为参数传递给它。

当这个请求完成数据传输后，调用该Body成员的Close()方法。

代码示例：

import "net/http"

func Pipe(from, to string) (err error) {
    src, err := http.Get(from)
    if err != nil {
        return
    }
    dst, err := http.Post(to, myPostType, src.Body)
    if err != nil {
        return
    }
    // 现在读取并关闭dst.Body成员。
}

在这段代码中，http.Post会从src.Body中读取数据，然后自己调用Close()方法关闭它。

你可以尝试将bytes.Buffer与上述方法结合使用，以减少系统调用的次数，但只有在普通方法无法工作时才这样做。

英文:

You could consider the fact your program has no need to keep the data in memory if all it needs to do is to copy it.

Now the strong feature of Go's standard library is sensible uses of interfaces: http.Response's Body member is something implementing the io.ReadCloser interface, and that satisfies the type of the body argument of the http.Client's Post method.

So you could roll like this:

Perform a request for the document—you'll get an instance of http.Response back, which has the Body member of type io.readCloser.

Note that at this point you haven't actually started receiving the body from the "source" server because to do that you'll have to drain the io.ReadCloser of Body.
Initiate another (supposedly POST) request to send the data, and when making the request supply it that Body member obtained in the first step.

Once this request is done piping your data, call Close() on that Body member.

Something like this:

import &quot;net/http&quot;

func Pipe(from, to string) (err error) {
    src, err := http.Get(from)
    if err != nil {
        return
    }
    dst, err := http.Post(to, myPostType, src.Body)
    if err != nil {
        return
    }
    // Now read and then Close() the dst.Body member.
}

In this code, http.Post will read from src.Body and then Close() it itself.

You might add bytes.Buffer into the mix in hope to reduce the amount of syscalls performed but don't do that unless the plain method does not work.

答案3

得分: 1

如@Evan已经指出的那样：在创建新缓冲区时，您可以选择初始缓冲区大小。

由于缓冲区的分配非常昂贵（这就是为什么您的grow调用需要很长时间；如果大小不再适合，它们会重新分配），选择正确的缓冲区大小非常重要。选择适合的缓冲区分配策略取决于许多因素。根据您的应用程序配置文件，您可以选择自己的缓冲区增长方法。

您还应考虑回收您的缓冲区以防止堆碎片化：http://blog.cloudflare.com/recycling-memory-buffers-in-go

英文:

As @Evan already pointed out: you can choose an initial buffer size when creating a new buffer.

Since allocation of buffers is so expensive (this is why your grow calls take so long; they re-allocate if the size does not fit anymore), picking the right buffer size is key. Picking the right strategy for buffer allocation depends on a lot of factors. You might choose your own method of growing buffers depending on your application profile.

You should also consider recycling your buffers to prevent heap fragmentation: http://blog.cloudflare.com/recycling-memory-buffers-in-go

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to work with large data arrays (over 10MiB) effectively in Go?

问题

答案1

答案2

答案3

Loop over a dynamic nested struct in golang

时区字符串格式化

使用多种方式转换十六进制值会得到不同的输出结果。

Golang解析以数组开头的JSON。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论