问题

在本地环境中，我成功地完成了以下任务：

打开 CSV 文件
逐行扫描文件（使用 Scanner.Scan）
将解析后的 CSV 行映射到我所需的结构体
将结构体保存到数据存储中

我发现 Blobstore 有一个读取器，可以使用流式文件接口直接读取值，但似乎有一个32MB的限制。我还看到有一个批量上传工具——bulk_uploader.py——但它不能完成我所需的所有数据处理工作，而且我希望限制这个批量插入的写入次数（以及成本）。

在没有从本地存储中读取的情况下，如何有效地读取和解析一个非常大（500MB+）的 CSV 文件呢？

英文:

Locally I am successfully able to (in a task):

Open the csv
Scan through each line (using Scanner.Scan)
Map the parsed CSV line to my desired struct
Save the struct to datastore

I see that blobstore has a reader that would allow me toread the value directly using a streaming file-like interface. -- but that seems to have a limit of 32MB. I also see there's a bulk upload tool -- bulk_uploader.py -- but it won't do all the data-massaging I require and I'd like to limit writes (and really cost) of this bulk insert.

How would one effectively read and parse a very large (500mb+) csv file without the benefit of reading from local storage?

答案1

得分: 2

你需要查看以下选项，并看看是否适合你：

鉴于文件大小较大，你应该考虑使用Google Cloud Storage来存储该文件。你可以使用GCS提供的命令行工具将文件上传到你的存储桶中。一旦上传完成，你可以直接使用JSON API来处理该文件，并将其导入到你的数据存储层中。请参考以下链接：https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples
如果这只是一次性导入一个大文件，另一个选项是启动一个Google Compute VM，在那里编写一个应用程序从GCS中读取数据，并通过较小的块传递给在App Engine Go中运行的服务，然后该服务可以接受并持久化数据。

英文:

You will need to look at the following options and see if it works for you :

Looking at the large file size, you should consider using Google Cloud Storage for the file. You can use the command line utilities that GCS provides to upload your file to your bucket. Once uploaded, you can look at using the JSON API directly to work with the file and import it into your datastore layer. Take a look at the following: https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples
If this is like a one time import of a large file, another option could be spinning up a Google Compute VM, writing an App there to read from GCS and pass on the data via smaller chunks to a Service running in App Engine Go, that can then accept and persist the data.

答案2

得分: 1

我帮你翻译一下：

不是我希望的解决方案，但最终我将大文件分割成32MB的片段，将每个片段上传到Blob存储，然后在一个任务中解析每个片段。

这并不美观，但比其他选项花费的时间更少。

英文:

Not a the solution I hoped for, but I ended up splitting the large files into 32MB pieces, uploading each to blob storage, then parsing each in a task.

It aint' pretty. But it took less time than the other options.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

导入和解析大型CSV文件使用Go和App Engine的数据存储

问题

答案1

答案2

如何在用户帖子中呈现带有链接但没有其他HTML的内容？

如何在golang中使用pkg/errors来注释错误并漂亮地打印堆栈跟踪？

使用Go动态创建编译二进制文件。

goroutines有高空闲唤醒调用。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论