将Pandas数据框以分块方式写入Google Cloud存储的CSV文件。

huangapple go评论60阅读模式
英文:

write a csv from pandas dataframe to google cloud storage in chunks

问题

以下是您要翻译的部分:

我尝试了以下操作
- gs_path 和 temp_gs_path 都是 GCS 路径格式为 "gs://bucket/file.csv"
- 代码在 Cloud Run 中运行可以访问云存储

```python
header = True
to_csv_mode = 'w'
with pd.read_csv(gs_path, chunksize=100000) as reader:
    for r in reader:
        r.to_csv(temp_gs_path, index=False, header=header, mode=to_csv_mode)
        header = False
        to_csv_mode = 'a'

但是在 GCS 存储桶中创建的文件始终被覆盖,而不是在第一次后进行附加(to_csv_mode = 'a' 被忽略)。因此,最终我得到了文件中的最后一个数据块。


<details>
<summary>英文:</summary>

I tried the following:
- gs_path and temp_gs_path are both gcs paths of the type &quot;gs://bucket/file.csv&quot;.
- Code is running from whithin cloud run and has access to the cloud storage.

```python
header = True
to_csv_mode = &#39;w&#39;
with pd.read_csv(gs_path, chunksize=100000) as reader:
    for r in reader:
        r.to_csv(temp_gs_path, index=False, header=header, mode=to_csv_mode)
        header = False
        to_csv_mode = &#39;a&#39;

But the file created in the gcs bucket is always overwritten and not appended after the first time (to_csv_mode = &#39;a&#39; is ignored). So in the end I end up with the last chunk in the file.

答案1

得分: 1

Google Cloud Storage 是 Google Cloud 中的对象存储服务。对象是由任何格式的文件组成的不可变数据片段。

根据官方文档

对象是不可变的,这意味着上传的对象在存储期间不能更改。对象的存储寿命是成功创建对象(例如上传)和成功删除对象之间的时间。实际上,这意味着您无法对对象进行增量更改,例如附加操作或截断操作。但是,可以原子方式替换存储在 Cloud Storage 中的对象:在新上传完成之前,对象的旧版本会提供给读取器,上传完成后,对象的新版本将提供给读取器。因此,单个替换操作仅标志着一个不可变对象寿命的结束和新不可变对象寿命的开始。

这意味着 Google Cloud Storage 不支持附加功能。如果您写入相同的对象名称,它将始终替换现有对象。

要实现这一点,您可以按照 组合对象 的方法,创建一个临时文件并将它们保存为每个块,然后将它们附加在一起作为一个新文件,然后可以删除临时文件。

英文:

Google Cloud Storage is the Object Storage service in Google Cloud. An object is an immutable piece of data consisting of a file of any format.

As per the official Documentation,
>Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime. An object's storage lifetime is the time between successful object creation, such as uploading, and successful object deletion. In practice, this means that you cannot make incremental changes to objects, such as append operations or truncate operations. However, it is possible to replace objects that are stored in Cloud Storage, and doing so happens atomically: until the new upload completes, the old version of the object is served to readers, and after the upload completes the new version of the object is served to readers. So a single replacement operation simply marks the end of one immutable object's lifetime and the beginning of a new immutable object's lifetime.

Which means append is not a functionality that Google Cloud Storage supports. If you write to the same object name, it is always going to replace the existing object.

To achieve this you can follow a workaround Compose Objects, by creating a temporary file and save them as a each chunk, and append them together as a new file and then you can delete a temporary files.

huangapple
  • 本文由 发表于 2023年3月7日 10:34:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657569.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定