将来自Google Cloud Storage的Parquet文件的分区列添加到BigQuery。

huangapple go评论71阅读模式
英文:

Add partition columns of Parquet files from Google Cloud Storage to BigQuery

问题

我有Parquet文件存储在Google Cloud Storage Bucket中,路径如下:
gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet

每个Parquet文件具有以下字段:
{'date':'2023-03-01','value': 2.718}

我正在将这些字段加载到BigQuery,并需要在最终表中包括分区列,即categorytype,以便事件将具有以下字段:
{'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}

目前,我正在迭代遍历目录gs://some_storage/files.parquet,从路径中提取类别和类型分区,将这些值附加到读取时的Parquet文件,并插入到BigQuery中。

由于Parquet文件通常使用此分区方式,应该有更好的方法。是否有任何方法,无论是通过pyarrow还是Google Cloud服务,可以直接读取分区,而无需遍历路径和使用正则表达式?或者是否有任何方式可以在BigQuery表中包括categorytype列中的数据?

提前感谢您的任何帮助。

我的当前方法如下:

import re
import gcsfs
import pyarrow.parquet as pq

fs = gcsfs.GCSFileSystem(project='gcs-project')

# 提取路径
paths = []
root_dir = 'gs://some_storage/files.parquet'
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
    feature_paths = fs.ls(category_path)
    for file_path in feature_paths:
        [file_paths] = fs.ls(file_path)
        paths.append(file_paths)

# 读取并附加分区列
for path in paths:
    category = re.search(r'category=(.*?)/', path).group(1)
    feature = re.search(r'feature=(.*?)/', path).group(1)

    df = pq.ParquetDataset(path)

    # 在df上附加类别和特征
    df['category'] = category
    df['feature'] = feature

# 最后插入到BigQuery
英文:

I have Parquet files stored in a Google Cloud Storage Bucket with paths such as:
gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet

Each parquet file has the fields:
{'date':'2023-03-01','value': 2.718}

I am loading these fields to BigQuery and I would need to include the partition columns, i.e. category and type in the final table, so that the event would have the fields:
{'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}

Currently i'm iterating over the directory gs://some_storage/files.parquet, extracting the category and type partitions with a regexp from the paths, appending the values to the parquet file at time of read and inserting to Bigquery.

There must be a better way since this form of partitioning is standard with parquet files. Is there any method, either via pyarrow or google cloud services that will read in the partition directly without having to iterate over paths and using a regexp? Or better is there any way I can result in the data in a BigQuery table including category and type columns?

Thank you in advance for any help.

My current method looks like this:

import re
import gcsfs
import pyarrow.parquet as pq

fs = gcsfs.GCSFileSystem(project='gcs-project')

# extract paths
paths = []
root_dir = 'gs://some_storage/files.parquet'
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
    feature_paths = fs.ls(category_path)
    for file_path in feature_paths:
        [file_paths] = fs.ls(file_path)
        paths.append(file_paths)


# read and append partition columns
for path in paths:
    category = re.search(r'category=(.*?)/', path).group(1)
    feature = re.search(r'feature=(.*?)/', path).group(1)

    df = pq.ParquetDataset(path)

    # append the category and feature on the df
    df['category'] = category
    df['feature'] = feature

# finally insert to bigquery

答案1

得分: 2

There must be a better way since this form of partitioning is standard with parquet files.

这种分区形式在 Parquet 文件中是标准的,因此可能有更好的方法。

One possible option is to use an external table. Assuming that you have parquet files stored in below paths

一个可能的选择是使用外部表。假设您有 Parquet 文件存储在以下路径中

You can create an external table for Hive partitioned data stored in GCS above.

您可以为存储在上述 GCS 中的 Hive 分区数据创建一个外部表。

CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table` 
  WITH PARTITION COLUMNS (
    category STRING,
    type STRING
  )
OPTIONS (
  uris=['gs://some_storage/files.parquet/*'],
  format='PARQUET',
  hive_partition_uri_prefix = 'gs://some_storage/files.parquet',
  require_hive_partition_filter = false
);

如果您查询该表,

SELECT * FROM `your-project.your-dataset.stg_table`;

you will get below result

您将获得以下结果

将来自Google Cloud Storage的Parquet文件的分区列添加到BigQuery。

You can submit above sql scripts to BigQuery via Python BigQuery SDK.

您可以通过 Python BigQuery SDK 将上述 SQL 脚本提交到 BigQuery。

See also

另请参阅

英文:

>There must be a better way since this form of partitioning is standard with parquet files.

One possible option is to use an external table. Assuming that you have parquet files stored in below paths

将来自Google Cloud Storage的Parquet文件的分区列添加到BigQuery。

You can create an external table for Hive partitioned data stored in GCS above.

CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table` 
  WITH PARTITION COLUMNS (
    category STRING,
    type STRING
  )
OPTIONS (
  uris=['gs://some_storage/files.parquet/*'],
  format='PARQUET',
  hive_partition_uri_prefix = 'gs://some_storage/files.parquet',
  require_hive_partition_filter = false
);

If you query the table,

SELECT * FROM `your-project.your-dataset.stg_table`;

you will get below result

将来自Google Cloud Storage的Parquet文件的分区列添加到BigQuery。

You can submit above sql scripts to BigQuery via Python BigQuery SDK.

See also

huangapple
  • 本文由 发表于 2023年2月24日 11:56:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75552483.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定