英文:
Add partition columns of Parquet files from Google Cloud Storage to BigQuery
问题
我有Parquet文件存储在Google Cloud Storage Bucket中,路径如下:
gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet
每个Parquet文件具有以下字段:
{'date':'2023-03-01','value': 2.718}
我正在将这些字段加载到BigQuery,并需要在最终表中包括分区列,即category
和type
,以便事件将具有以下字段:
{'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}
目前,我正在迭代遍历目录gs://some_storage/files.parquet
,从路径中提取类别和类型分区,将这些值附加到读取时的Parquet文件,并插入到BigQuery中。
由于Parquet文件通常使用此分区方式,应该有更好的方法。是否有任何方法,无论是通过pyarrow还是Google Cloud服务,可以直接读取分区,而无需遍历路径和使用正则表达式?或者是否有任何方式可以在BigQuery表中包括category
和type
列中的数据?
提前感谢您的任何帮助。
我的当前方法如下:
import re
import gcsfs
import pyarrow.parquet as pq
fs = gcsfs.GCSFileSystem(project='gcs-project')
# 提取路径
paths = []
root_dir = 'gs://some_storage/files.parquet'
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
feature_paths = fs.ls(category_path)
for file_path in feature_paths:
[file_paths] = fs.ls(file_path)
paths.append(file_paths)
# 读取并附加分区列
for path in paths:
category = re.search(r'category=(.*?)/', path).group(1)
feature = re.search(r'feature=(.*?)/', path).group(1)
df = pq.ParquetDataset(path)
# 在df上附加类别和特征
df['category'] = category
df['feature'] = feature
# 最后插入到BigQuery
英文:
I have Parquet files stored in a Google Cloud Storage Bucket with paths such as:
gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet
Each parquet file has the fields:
{'date':'2023-03-01','value': 2.718}
I am loading these fields to BigQuery and I would need to include the partition columns, i.e. category
and type
in the final table, so that the event would have the fields:
{'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}
Currently i'm iterating over the directory gs://some_storage/files.parquet
, extracting the category and type partitions with a regexp from the paths, appending the values to the parquet file at time of read and inserting to Bigquery.
There must be a better way since this form of partitioning is standard with parquet files. Is there any method, either via pyarrow or google cloud services that will read in the partition directly without having to iterate over paths and using a regexp? Or better is there any way I can result in the data in a BigQuery table including category
and type
columns?
Thank you in advance for any help.
My current method looks like this:
import re
import gcsfs
import pyarrow.parquet as pq
fs = gcsfs.GCSFileSystem(project='gcs-project')
# extract paths
paths = []
root_dir = 'gs://some_storage/files.parquet'
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
feature_paths = fs.ls(category_path)
for file_path in feature_paths:
[file_paths] = fs.ls(file_path)
paths.append(file_paths)
# read and append partition columns
for path in paths:
category = re.search(r'category=(.*?)/', path).group(1)
feature = re.search(r'feature=(.*?)/', path).group(1)
df = pq.ParquetDataset(path)
# append the category and feature on the df
df['category'] = category
df['feature'] = feature
# finally insert to bigquery
答案1
得分: 2
There must be a better way since this form of partitioning is standard with parquet files.
这种分区形式在 Parquet 文件中是标准的,因此可能有更好的方法。
One possible option is to use an external table. Assuming that you have parquet files stored in below paths
一个可能的选择是使用外部表。假设您有 Parquet 文件存储在以下路径中
You can create an external table for Hive partitioned data stored in GCS above.
您可以为存储在上述 GCS 中的 Hive 分区数据创建一个外部表。
CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table`
WITH PARTITION COLUMNS (
category STRING,
type STRING
)
OPTIONS (
uris=['gs://some_storage/files.parquet/*'],
format='PARQUET',
hive_partition_uri_prefix = 'gs://some_storage/files.parquet',
require_hive_partition_filter = false
);
如果您查询该表,
SELECT * FROM `your-project.your-dataset.stg_table`;
you will get below result
您将获得以下结果
You can submit above sql scripts to BigQuery via Python BigQuery SDK.
您可以通过 Python BigQuery SDK 将上述 SQL 脚本提交到 BigQuery。
See also
另请参阅
英文:
>There must be a better way since this form of partitioning is standard with parquet files.
One possible option is to use an external table. Assuming that you have parquet files stored in below paths
You can create an external table for Hive partitioned data stored in GCS above.
CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table`
WITH PARTITION COLUMNS (
category STRING,
type STRING
)
OPTIONS (
uris=['gs://some_storage/files.parquet/*'],
format='PARQUET',
hive_partition_uri_prefix = 'gs://some_storage/files.parquet',
require_hive_partition_filter = false
);
If you query the table,
SELECT * FROM `your-project.your-dataset.stg_table`;
you will get below result
You can submit above sql scripts to BigQuery via Python BigQuery SDK.
See also
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论