英文:
How to export BQ tables to GCS in JSON format without change in encoding
问题
我使用Python将一个BQ表以JSON格式导出到了GCS。导出成功了,但当我从GCS下载JSON文件时,我注意到特殊字符已经改变了。例如,
BQ中的 "Shirt & Trouser Presses"
在GCS中变成了
"Shirt \u0026 Trouser Presses"
有没有一种方法可以确保在从BQ导出到GCS的过程中不改变编码?
以下是我使用的代码片段:
dataset_ref = bigquery.DatasetReference(BQ_PROJECT, dataset_id)
client = bigquery.Client(project=BQ_PROJECT)
tables = client.list_tables(dataset_id)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
for table in tables:
if table.table_type == "TABLE":
table_id = table.table_id
destination_blob = table_id
table_ref = dataset_ref.table(table_id)
destination_uri = "gs://{}/{}".format(BUCKET, destination_blob)
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=job_config,
# Location must match that of the source table.
location="EU",
) # API request
extract_job.result() # Waits for job to complete.
英文:
I exported a BQ table to GCS in JSON format using python. The export was successful, however, when I download the JSON files from GCS, I noticed that special caracters have changed. For example,
Shirt & Trouser Presses
in BQ has changed to
Shirt \u0026 Trouser Presses
in GCS.
Is there a way to to ensure that the encoding does not change while exporting from BQ to GCS in JSON format?
Here is the code snippet I use:
dataset_ref = bigquery.DatasetReference(BQ_PROJECT, dataset_id)
client = bigquery.Client(project=BQ_PROJECT)
tables = client.list_tables(dataset_id)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
for table in tables:
if table.table_type == "TABLE":
table_id = table.table_id
destination_blob = table_id
table_ref = dataset_ref.table(table_id)
destination_uri = "gs://{}/{}".format(BUCKET, destination_blob)
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=job_config,
# Location must match that of the source table.
location="EU",
) # API request
extract_job.result() # Waits for job to complete.
答案1
得分: 2
我通过@johnHanley的帮助发现,当我使用pandas从GCS读取数据时,我得到了正确的编码。因此,"Shirt \u0026 Trouser Presses"
将被读取为 "Shirt & Trouser Presses"
,问题因此解决。
英文:
With the help of @johnHanley, I figured out that when I read data from GCS using pandas I get the right encoding back. So "Shirt \u0026 Trouser Presses"
will be read as "Shirt & Trouser Presses"
using pandas. Hence the problem is solved
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论