2022年8月6日 04:13:57go评论111阅读模式

英文:

Can I stream data from a writer to a reader in golang?

问题

我想处理一些内容超出工作机器内存的文件。到目前为止，我找到的解决方案是在上传到S3之前将处理结果保存到/tmp目录中。

有没有办法避免将内容保存到临时文件中，而只使用有限大小的字节缓冲区在写入器和上传函数之间进行传输？

换句话说，我能否在仍然向同一缓冲区写入数据的同时开始向读取器流式传输数据？

英文:

I want to process a number of files whose contents don't fit in the memory of my worker. The solution I found so far involves saving the results to the processing to the /tmp directory before uploading it to S3.

import (
	&quot;bufio&quot;
	&quot;bytes&quot;
	&quot;context&quot;
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;os&quot;
	&quot;runtime&quot;
	&quot;strings&quot;
	&quot;sync&quot;
	&quot;github.com/aws/aws-sdk-go-v2/service/s3&quot;
	&quot;github.com/korovkin/limiter&quot;
	&quot;github.com/xitongsys/parquet-go/parquet&quot;
	&quot;github.com/xitongsys/parquet-go/writer&quot;
)
func DownloadWarc(
	ctx context.Context,
	s3Client *s3.Client,
    warcs []*types.Warc,
    path string,
) error {
	key := fmt.Sprintf(&quot;parsed_warc/%s.parquet&quot;, path)
	filename := fmt.Sprintf(&quot;/tmp/%s&quot;, path)
	file, err := os.Create(filename)
	if err != nil {
		return fmt.Errorf(&quot;error creating file: %s&quot;, err)
	}
	defer file.Close()
	bytesWriter := bufio.NewWriter(file)
	pw, err := writer.NewParquetWriterFromWriter(bytesWriter, new(Page), 4)
	if err != nil {
		return fmt.Errorf(&quot;Can&#39;t create parquet writer: %s&quot;, err)
	}
	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	mutex := sync.Mutex{}
	numWorkers := runtime.NumCPU() * 2
	fmt.Printf(&quot;Using %d workers\n&quot;, numWorkers)
	limit := limiter.NewConcurrencyLimiter(numWorkers)
	for i, warc := range warcs {
		limit.Execute(func() {
			log.Printf(&quot;%d: %+v&quot;, i, warc)
			body, err := GetWarc(ctx, s3Client, warc)
			if err != nil {
				fmt.Printf(&quot;error getting warc: %s&quot;, err)
				return
			}
			page, err := Parse(body)
			if err != nil {
				key := fmt.Sprintf(&quot;unparsed_warc/%s.warc&quot;, path)
				s3Client.PutObject(
					ctx,
					&amp;s3.PutObjectInput{
						Body:   bytes.NewReader(body),
						Bucket: &amp;s3Record.Bucket.Name,
						Key:    &amp;key,
					},
				)
				fmt.Printf(&quot;error getting page %s: %s&quot;, key, err)
				return
			}
			mutex.Lock()
			err = pw.Write(page)
			pw.Flush(true)
			mutex.Unlock()
			if err != nil {
				fmt.Printf(&quot;error writing page: %s&quot;, err)
				return
			}
		})
	}
	limit.WaitAndClose()
	err = pw.WriteStop()
	if err != nil {
		return fmt.Errorf(&quot;error writing stop: %s&quot;, err)
	}
	bytesWriter.Flush()
	file.Seek(0, 0)
	_, err = s3Client.PutObject(
		ctx,
		&amp;s3.PutObjectInput{
			Body:   file,
			Bucket: &amp;s3Record.Bucket.Name,
			Key:    &amp;key,
		},
	)
	if err != nil {
		return fmt.Errorf(&quot;error uploading warc: %s&quot;, err)
	}
	return nil
}

Is there a way to avoid saving the contents into a temp file and use only a limited size byte buffer between the writer and the upload function?

In other words can I begin to stream data to a reader while still writing to the same buffer?

答案1

得分: 1

是的，有一种方法可以将相同的内容写入多个写入器。使用io.MultiWriter可以让您不使用临时文件。然而，最好还是使用临时文件。

我经常使用io.MultiWriter将内容写入到一系列校验和（如sha256...）计算器中。实际上，上次我阅读S3客户端代码时，我注意到它在内部使用这种方式来计算校验和。对于在云端之间传输大文件，MultiWriter非常有用。

另外，如果您最终使用临时文件，您可能希望使用os.CreateTemp来创建临时文件。如果不这样做，如果您的代码在两个进程中运行或者您的文件具有相同的名称，可能会遇到文件名冲突的问题。

请随时澄清您的问题。我可以再次尝试回答

英文:

Yes there is a way to write the same content to multiple writers. Using io.MultiWriter might allow you to not use a temp file. However, it might still be good to use a temp file.

I often use io.MultiWriter to write to a list of checksum (sha256...) calculators. Actually, last time I read the the S3 client code, I noticed it does this under the hood to calculate the checksum. MultiWriter is pretty useful for piping big files between cloud places.

Also, if you end up using temp files. You may want to use os.CreateTemp to create temporary files. If you don't, you may run into issues with your created file names if your code is running in two processes or your files have the same name.

Feel free to clarify your question. I can try to answer again

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Can I stream data from a writer to a reader in golang?

问题

答案1

Golang: What does the zlib.NewWriterLevelDict / zlib.NewReaderDict do?

go generate是否有一种方法可以跳过未更改的文件或包。

How can I manage accounts for my google app engine (with golang)?

嵌套聚合在Ent查询中。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。