在BigQuery中创建无模式架构的备份 / 通过查询从JSON类型创建表

huangapple go评论106阅读模式
英文:

Creating Schema Agnostic Backups in BigQuery / Creating Table from JSON type through query

问题

感谢@NickW的提醒,通过BigQuery的本机快照功能,整个问题变得无关紧要...最好避免重新发明轮子,而是使用框架支持的内容!我对这个解决方案感到满意。

我有以下问题:
我想创建无需模式的备份,也就是说,如果原始数据的模式发生更改,不会破坏我的备份。

我的想法是将其存储为JSON文件,然后查询它,就像这样(在BigQuery SQL中):

WITH aggregated_table as (
  -- 将所有数据聚合成一个JSON文件
  SELECT id ,TO_JSON(row_struct) AS data
  -- 将每一行提取为STRUCT
  FROM mytable row_struct)

-- 备份使用时间作为主键
SELECT CURRENT_TIMESTAMP() as backup_time, id, data from aggregated_table

唯一的问题是备份的恢复非常繁琐,我没有找到让它自动识别模式或至少自动识别列名的方法。

在我的情况下,表格有300多列,我不想像这样指定每一列:

SELECT
  backup_time,
  data.id as id,
  TIMESTAMP(data.some_time_value) as some_time_value,
  TIMESTAMP(data.some_string_value, ") as some_string_value,
  --...另外300多列...
FROM my_backup WHERE backup_time = xxx

我已经长时间研究了BigQuery上的JSON文档,但我找不到适合我的情况的功能...

是否有人有解决方案?也许像这样使用JSON文件工作是一种反模式,还有其他方法可以做到吗?

非常感谢!

我期望有一个类似于展开JSON的功能,基本上可以逆转我为备份所做的打包。我知道Python中的Pandas库可以从JSON中填充表格,但我想找到在BigQuery SQL中执行此操作的方法。

英文:

Edit: Thanks to @NickW I was shown that with the Native Snapshot function of BigQuery the whole question becomes irrelevant ... it is better avoid reinventing the weel but use what is supported by the framework insted! I'm happy with the solution.

I have the following problem:
I want to create backups that are schema-agnostic, meaning, if the schema (of my raw data) is changed it doesn't fry up my backups.

My idea was to store it into a JSON file, and then query it, like this (in BigQuery SQL):

WITH aggregated_table as (
  -- Aggregate all data into one JSON file
  SELECT id ,TO_JSON(row_struct) AS data
  -- Extract each row as a STRUCT
  FROM mytable row_struct)

-- Backups use time as primary key
SELECT CURRENT_TIMESTAMP() as backup_time, id, data from aggregated_table

The only issue is that the recovery of the backups is really tedious and I haven't found a way to make it recognize the schema, or at least the column names by itself.

In my case the table has 300+ columns and I don't want to specify every single one like this:

SELECT
  backup_time,
  data.id as id,
  TIMESTAMP(data.some_time_value) as some_time_value,
  TIM(data.some_string_value, ") as some_string_value,
  --...300 more columns ...
FROM my_backup WHERE backup_time = xxx

I have looked at the JSON documentations on BigQuery for a long time, but I couldn't find the functionality that fits my case ...

Does anyone have a solution? Maybe working with JSON files like this is an antipattern and there is another way to do it?

Many thanks!

I expected a function like flatten JSON that basically reverses my packing I did for the backup. I know the Pandas library in Python can fill table from JSONs but I want to find a way to do it in BigQuery SQL.

答案1

得分: 0

不要自行创建解决方案,我建议从现有的一些内置过程开始(如表格导出、快照等)。

英文:

Comment converted to answer…

Instead of rolling your own solution, I would start with one of the processes available out-of-the-box (table export, snapshots, etc)

huangapple
  • 本文由 发表于 2023年3月7日 22:13:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663097.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定