导入和解析大型CSV文件使用Go和App Engine的数据存储

huangapple go评论120阅读模式
英文:

Importing and parsing a large CSV file with go and app engine's datastore

问题

在本地环境中,我成功地完成了以下任务:

  • 打开 CSV 文件
  • 逐行扫描文件(使用 Scanner.Scan)
  • 将解析后的 CSV 行映射到我所需的结构体
  • 将结构体保存到数据存储中

我发现 Blobstore 有一个读取器,可以使用流式文件接口直接读取值,但似乎有一个32MB的限制。我还看到有一个批量上传工具——bulk_uploader.py——但它不能完成我所需的所有数据处理工作,而且我希望限制这个批量插入的写入次数(以及成本)。

在没有从本地存储中读取的情况下,如何有效地读取和解析一个非常大(500MB+)的 CSV 文件呢?

英文:

Locally I am successfully able to (in a task):

  • Open the csv
  • Scan through each line (using Scanner.Scan)
  • Map the parsed CSV line to my desired struct
  • Save the struct to datastore

I see that blobstore has a reader that would allow me toread the value directly using a streaming file-like interface. -- but that seems to have a limit of 32MB. I also see there's a bulk upload tool -- bulk_uploader.py -- but it won't do all the data-massaging I require and I'd like to limit writes (and really cost) of this bulk insert.

How would one effectively read and parse a very large (500mb+) csv file without the benefit of reading from local storage?

答案1

得分: 2

你需要查看以下选项,并看看是否适合你:

  1. 鉴于文件大小较大,你应该考虑使用Google Cloud Storage来存储该文件。你可以使用GCS提供的命令行工具将文件上传到你的存储桶中。一旦上传完成,你可以直接使用JSON API来处理该文件,并将其导入到你的数据存储层中。请参考以下链接:https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples

  2. 如果这只是一次性导入一个大文件,另一个选项是启动一个Google Compute VM,在那里编写一个应用程序从GCS中读取数据,并通过较小的块传递给在App Engine Go中运行的服务,然后该服务可以接受并持久化数据。

英文:

You will need to look at the following options and see if it works for you :

  1. Looking at the large file size, you should consider using Google Cloud Storage for the file. You can use the command line utilities that GCS provides to upload your file to your bucket. Once uploaded, you can look at using the JSON API directly to work with the file and import it into your datastore layer. Take a look at the following: https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples

  2. If this is like a one time import of a large file, another option could be spinning up a Google Compute VM, writing an App there to read from GCS and pass on the data via smaller chunks to a Service running in App Engine Go, that can then accept and persist the data.

答案2

得分: 1

我帮你翻译一下:

不是我希望的解决方案,但最终我将大文件分割成32MB的片段,将每个片段上传到Blob存储,然后在一个任务中解析每个片段。

这并不美观,但比其他选项花费的时间更少。

英文:

Not a the solution I hoped for, but I ended up splitting the large files into 32MB pieces, uploading each to blob storage, then parsing each in a task.

It aint' pretty. But it took less time than the other options.

huangapple
  • 本文由 发表于 2014年7月27日 09:55:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/24977231.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定