英文:
How to work with large data arrays (over 10MiB) effectively in Go?
问题
我正在使用go语言从一个服务器下载文件,并在处理文件后将其发送到另一个服务器。
文件的大小可以从1MB到200MB不等。
目前,我的代码非常简单,我正在使用http.Client和bytes.Buffer。
处理这些大文件(100MB到200MB)需要很长时间,而且这样的文件有很多。
经过快速分析,我发现大部分时间都花在了bytes.(*Buffer).grow上。
我该如何创建一个16MB大小的大缓冲区呢?
为了提高代码的效率,我应该做些什么?有关处理大型HTTP请求的一般提示?
编辑
我将解释一下我想要做的事情。
我有一些带有附件的couchdb文档,我想要将它们复制到另一个couchdb实例。
couchdb文档的大小可以从30MB到200MB不等,复制小型(2-10MB)的couchdb文档非常快。
但是通过网络发送文档非常慢。
我目前正在尝试进行分析,并尝试使用@Evan的答案来查看我的问题所在。
英文:
I am working with go to download files from one server and after manipulating the files sending it to another server.
The files size can vary from 1MB to 200MB.
Currently, my code is pretty simple, I am using http.Client and bytes.Buffer . <br/>
It takes lot of time to handle does big files (the 100MB to 200MB) which there is a lot of them.
After a quick profiling, I see that most of the time I do bytes.(*Buffer).grow, <br/>
How can I create big buffers for example for 16MB?
What can I do in order to improve my efficiency of the code? General tips for handling with large http requests?
Edit
I will explain, exactly what I am trying to do.
I have couchdb documents (with attachments) that I am trying to copy to another couchdb instance.
The couchdb documents size can be from 30MB to 200MB, copying tiny (2 - 10MB) couchdb documents - is really fast.
But sending the document over the wire is really slow.
I am currently, trying to profile, and try to use @Evan answer to see what is my problem.
答案1
得分: 5
请看一下bytes.NewBuffer
的描述:http://golang.org/pkg/bytes/#NewBuffer
听起来你可以创建一个16MB的字节切片,并使用它来初始化缓冲区。
英文:
Take a look at the description for bytes.NewBuffer
: http://golang.org/pkg/bytes/#NewBuffer
Sounds like you can create a 16MB byte slice and use it to initialize the buffer.
答案2
得分: 2
你可以考虑这样一个事实:如果程序只需要复制数据,而不需要将数据保留在内存中,那么就没有必要将数据保留在内存中。
现在,Go标准库的一个强大特性是合理使用接口:http.Response
的Body
成员是实现了io.ReadCloser
接口的对象,并且它满足http.Client
的Post
方法的body
参数的类型要求。
所以你可以这样操作:
-
发起一个请求以获取文档,你将得到一个
http.Response
实例,它的Body
成员的类型是io.ReadCloser
。注意,此时你实际上还没有开始从“源”服务器接收正文,因为要做到这一点,你需要读取
Body
的io.ReadCloser
。 -
发起另一个(假设是
POST
)请求来发送数据,并在发起请求时将第一步得到的Body
成员作为参数传递给它。当这个请求完成数据传输后,调用该
Body
成员的Close()
方法。
代码示例:
import "net/http"
func Pipe(from, to string) (err error) {
src, err := http.Get(from)
if err != nil {
return
}
dst, err := http.Post(to, myPostType, src.Body)
if err != nil {
return
}
// 现在读取并关闭dst.Body成员。
}
在这段代码中,http.Post
会从src.Body
中读取数据,然后自己调用Close()
方法关闭它。
你可以尝试将bytes.Buffer
与上述方法结合使用,以减少系统调用的次数,但只有在普通方法无法工作时才这样做。
英文:
You could consider the fact your program has no need to keep the data in memory if all it needs to do is to copy it.
Now the strong feature of Go's standard library is sensible uses of interfaces: http.Response
's Body
member is something implementing the io.ReadCloser
interface, and that satisfies the type of the body
argument of the http.Client
's Post
method.
So you could roll like this:
-
Perform a request for the document—you'll get an instance of
http.Response
back, which has theBody
member of typeio.readCloser
.Note that at this point you haven't actually started receiving the body from the "source" server because to do that you'll have to drain the
io.ReadCloser
ofBody
. -
Initiate another (supposedly
POST
) request to send the data, and when making the request supply it thatBody
member obtained in the first step.Once this request is done piping your data, call
Close()
on thatBody
member.
Something like this:
import "net/http"
func Pipe(from, to string) (err error) {
src, err := http.Get(from)
if err != nil {
return
}
dst, err := http.Post(to, myPostType, src.Body)
if err != nil {
return
}
// Now read and then Close() the dst.Body member.
}
In this code, http.Post
will read from src.Body
and then Close()
it itself.
You might add bytes.Buffer
into the mix in hope to reduce the amount of syscalls performed but don't do that unless the plain method does not work.
答案3
得分: 1
如@Evan已经指出的那样:在创建新缓冲区时,您可以选择初始缓冲区大小。
由于缓冲区的分配非常昂贵(这就是为什么您的grow
调用需要很长时间;如果大小不再适合,它们会重新分配),选择正确的缓冲区大小非常重要。选择适合的缓冲区分配策略取决于许多因素。根据您的应用程序配置文件,您可以选择自己的缓冲区增长方法。
您还应考虑回收您的缓冲区以防止堆碎片化:http://blog.cloudflare.com/recycling-memory-buffers-in-go
英文:
As @Evan already pointed out: you can choose an initial buffer size when creating a new buffer.
Since allocation of buffers is so expensive (this is why your grow
calls take so long; they re-allocate if the size does not fit anymore), picking the right buffer size is key. Picking the right strategy for buffer allocation depends on a lot of factors. You might choose your own method of growing buffers depending on your application profile.
You should also consider recycling your buffers to prevent heap fragmentation: http://blog.cloudflare.com/recycling-memory-buffers-in-go
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论