2023年2月24日 11:56:52go评论107阅读模式

英文:

Add partition columns of Parquet files from Google Cloud Storage to BigQuery

问题

我有Parquet文件存储在Google Cloud Storage Bucket中，路径如下：
gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet

每个Parquet文件具有以下字段：
{'date':'2023-03-01','value': 2.718}

我正在将这些字段加载到BigQuery，并需要在最终表中包括分区列，即category和type，以便事件将具有以下字段：
{'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}

目前，我正在迭代遍历目录gs://some_storage/files.parquet，从路径中提取类别和类型分区，将这些值附加到读取时的Parquet文件，并插入到BigQuery中。

由于Parquet文件通常使用此分区方式，应该有更好的方法。是否有任何方法，无论是通过pyarrow还是Google Cloud服务，可以直接读取分区，而无需遍历路径和使用正则表达式？或者是否有任何方式可以在BigQuery表中包括category和type列中的数据？

提前感谢您的任何帮助。

我的当前方法如下：

import re
import gcsfs
import pyarrow.parquet as pq
fs = gcsfs.GCSFileSystem(project='gcs-project')
# 提取路径
paths = []
root_dir = 'gs://some_storage/files.parquet'
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
    feature_paths = fs.ls(category_path)
    for file_path in feature_paths:
        [file_paths] = fs.ls(file_path)
        paths.append(file_paths)
# 读取并附加分区列
for path in paths:
    category = re.search(r'category=(.*?)/', path).group(1)
    feature = re.search(r'feature=(.*?)/', path).group(1)
    df = pq.ParquetDataset(path)
    # 在df上附加类别和特征
    df['category'] = category
    df['feature'] = feature
# 最后插入到BigQuery

英文:

I have Parquet files stored in a Google Cloud Storage Bucket with paths such as:
gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet

Each parquet file has the fields:
{'date':'2023-03-01','value': 2.718}

I am loading these fields to BigQuery and I would need to include the partition columns, i.e. category and type in the final table, so that the event would have the fields:
{'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}

Currently i'm iterating over the directory gs://some_storage/files.parquet, extracting the category and type partitions with a regexp from the paths, appending the values to the parquet file at time of read and inserting to Bigquery.

There must be a better way since this form of partitioning is standard with parquet files. Is there any method, either via pyarrow or google cloud services that will read in the partition directly without having to iterate over paths and using a regexp? Or better is there any way I can result in the data in a BigQuery table including category and type columns?

Thank you in advance for any help.

My current method looks like this:

import re
import gcsfs
import pyarrow.parquet as pq
fs = gcsfs.GCSFileSystem(project=&#39;gcs-project&#39;)
# extract paths
paths = []
root_dir = &#39;gs://some_storage/files.parquet&#39;
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
    feature_paths = fs.ls(category_path)
    for file_path in feature_paths:
        [file_paths] = fs.ls(file_path)
        paths.append(file_paths)
# read and append partition columns
for path in paths:
    category = re.search(r&#39;category=(.*?)/&#39;, path).group(1)
    feature = re.search(r&#39;feature=(.*?)/&#39;, path).group(1)
    df = pq.ParquetDataset(path)
    # append the category and feature on the df
    df[&#39;category&#39;] = category
    df[&#39;feature&#39;] = feature
# finally insert to bigquery

答案1

得分: 2

There must be a better way since this form of partitioning is standard with parquet files.

这种分区形式在 Parquet 文件中是标准的，因此可能有更好的方法。

One possible option is to use an external table. Assuming that you have parquet files stored in below paths

一个可能的选择是使用外部表。假设您有 Parquet 文件存储在以下路径中

You can create an external table for Hive partitioned data stored in GCS above.

您可以为存储在上述 GCS 中的 Hive 分区数据创建一个外部表。

CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table` 
  WITH PARTITION COLUMNS (
    category STRING,
    type STRING
  )
OPTIONS (
  uris=[&#39;gs://some_storage/files.parquet/*&#39;],
  format=&#39;PARQUET&#39;,
  hive_partition_uri_prefix = &#39;gs://some_storage/files.parquet&#39;,
  require_hive_partition_filter = false
);

如果您查询该表，

SELECT * FROM `your-project.your-dataset.stg_table`;

you will get below result

您将获得以下结果

You can submit above sql scripts to BigQuery via Python BigQuery SDK.

您可以通过 Python BigQuery SDK 将上述 SQL 脚本提交到 BigQuery。

See also

Create an external table for Hive partitioned data

另请参阅

为 Hive 分区数据创建外部表

英文:

>There must be a better way since this form of partitioning is standard with parquet files.

One possible option is to use an external table. Assuming that you have parquet files stored in below paths

You can create an external table for Hive partitioned data stored in GCS above.

CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table` 
  WITH PARTITION COLUMNS (
    category STRING,
    type STRING
  )
OPTIONS (
  uris=[&#39;gs://some_storage/files.parquet/*&#39;],
  format=&#39;PARQUET&#39;,
  hive_partition_uri_prefix = &#39;gs://some_storage/files.parquet&#39;,
  require_hive_partition_filter = false
);

If you query the table,

SELECT * FROM `your-project.your-dataset.stg_table`;

you will get below result

You can submit above sql scripts to BigQuery via Python BigQuery SDK.

See also

Create an external table for Hive partitioned data

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将来自Google Cloud Storage的Parquet文件的分区列添加到BigQuery。

问题

答案1

使用Python代码覆盖Google Cloud Storage存储桶中的单个文件。

Python正则表达式通过AND条件查找匹配的单词。

`PYTHONPATH`真的是一个环境变量吗？

Running multiple scripts in sequence in Python.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。