如何将BQ表以JSON格式导出到GCS而不更改编码方式。

huangapple go评论74阅读模式
英文:

How to export BQ tables to GCS in JSON format without change in encoding

问题

我使用Python将一个BQ表以JSON格式导出到了GCS。导出成功了,但当我从GCS下载JSON文件时,我注意到特殊字符已经改变了。例如,

BQ中的 "Shirt & Trouser Presses"

在GCS中变成了

"Shirt \u0026 Trouser Presses"

有没有一种方法可以确保在从BQ导出到GCS的过程中不改变编码?

以下是我使用的代码片段:

dataset_ref = bigquery.DatasetReference(BQ_PROJECT, dataset_id)
client = bigquery.Client(project=BQ_PROJECT)
tables = client.list_tables(dataset_id)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
for table in tables:
    if table.table_type == "TABLE":
        table_id = table.table_id
        destination_blob = table_id
        table_ref = dataset_ref.table(table_id)
        destination_uri = "gs://{}/{}".format(BUCKET, destination_blob)

        extract_job = client.extract_table(
            table_ref,
            destination_uri,
            job_config=job_config,
            # Location must match that of the source table.
            location="EU",
        )  # API request
        extract_job.result()  # Waits for job to complete.
英文:

I exported a BQ table to GCS in JSON format using python. The export was successful, however, when I download the JSON files from GCS, I noticed that special caracters have changed. For example,

Shirt & Trouser Presses

in BQ has changed to

Shirt \u0026 Trouser Presses

in GCS.

Is there a way to to ensure that the encoding does not change while exporting from BQ to GCS in JSON format?

Here is the code snippet I use:

dataset_ref = bigquery.DatasetReference(BQ_PROJECT, dataset_id)
        client = bigquery.Client(project=BQ_PROJECT)
        tables = client.list_tables(dataset_id)
        job_config = bigquery.job.ExtractJobConfig()
        job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
        for table in tables:
            if table.table_type == "TABLE":
                table_id = table.table_id
                destination_blob = table_id
                table_ref = dataset_ref.table(table_id)
                destination_uri = "gs://{}/{}".format(BUCKET, destination_blob)
    
                extract_job = client.extract_table(
                    table_ref,
                    destination_uri,
                    job_config=job_config,
                    # Location must match that of the source table.
                    location="EU",
                )  # API request
                extract_job.result()  # Waits for job to complete.

答案1

得分: 2

我通过@johnHanley的帮助发现,当我使用pandas从GCS读取数据时,我得到了正确的编码。因此,"Shirt \u0026 Trouser Presses" 将被读取为 "Shirt & Trouser Presses",问题因此解决。

英文:

With the help of @johnHanley, I figured out that when I read data from GCS using pandas I get the right encoding back. So "Shirt \u0026 Trouser Presses" will be read as "Shirt & Trouser Presses" using pandas. Hence the problem is solved

huangapple
  • 本文由 发表于 2023年6月25日 22:55:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551022.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定