提取 Athena / Presto 中具有动态键的 JSON 对象

huangapple go评论85阅读模式
英文:

Extracting JSON objects in Athena / Presto for dynamic keys

问题

以下是翻译好的部分:

我在我的表中有两个格式化为字符串的列。其中一个列中包含JSON对象。JSON对象中的键是不固定的,所以问题在于我不能使用标准的JSON提取函数。下面是表的外观:

timestamp upstream_tables
2023-02-02T17:34:55.946Z {"ETL_table_1":true,"ETL_table_2":true}
2023-02-02T13:30:11.882Z {"ETL_table_3":true}
2023-02-02T20:12:22.116Z {"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}

我想要将这个表展平成以下的形式:

timestamp upstream_tablename was_completed
2023-02-02T17:34:55.946Z ETL_table_1 TRUE
2023-02-02T17:34:55.946Z ETL_table_2 TRUE
2023-02-02T13:30:11.882Z ETL_table_3 TRUE
2023-02-02T20:12:22.116Z ETL_table_4 TRUE
2023-02-02T20:12:22.116Z ETL_table_5 TRUE
2023-02-02T20:12:22.116Z ETL_table_6 FALSE

有人可以帮忙吗?已经花了很多时间使用map_values和map_keys,但无法弄对。

我唯一能想出的最接近的方法是这样的:

select
   timestamp,
    t.*
FROM mytable
   CROSS JOIN UNNEST(map_entries(CAST(json_extract(upstream_tables, '$') AS MAP(VARCHAR, VARCHAR)))) AS t

希望这对你有帮助。

英文:

I have two string formatted columns in my table. One of the columns has json objects in it. The keys in the JSON objects are not fixed - so the problem is I cant use standard json extract functions. Here is how the table looks

timestamp upstream_tables
2023-02-02T17:34:55.946Z {"ETL_table_1":true,"ETL_table_2":true}
2023-02-02T13:30:11.882Z {"ETL_table_3":true}
2023-02-02T20:12:22.116Z {"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}

I want to flatten this table to something like below

timestamp upstream_tablename was_completed
2023-02-02T17:34:55.946Z ETL_table_1 TRUE
2023-02-02T17:34:55.946Z ETL_table_2 TRUE
2023-02-02T13:30:11.882Z ETL_table_3 TRUE
2023-02-02T20:12:22.116Z ETL_table_4 TRUE
2023-02-02T20:12:22.116Z ETL_table_5 TRUE
2023-02-02T20:12:22.116Z ETL_table_6 FALSE

Can anyone please help? Have spent a lot of time using map_values and map_keys but couldnt get it right.

The only closest thing I could come up with was this

select
   timestamp,
    t.*
FROM mytable
   CROSS JOIN UNNEST(map_entries(CAST(json_extract(upstream_tables, '$') AS MAP(VARCHAR, VARCHAR)))) AS t

答案1

得分: 1

以下是您要翻译的内容:

@martin-traverso 的答案可用于 Athena 引擎 v. 3,该引擎基于 Trino。对于 v.2 和 v.3,您已经发现了将数据转换为 map 的主要技巧。我建议从使用 json_extract 切换到 json_parse(将字符串转换为 JSON),跳过 map_entries(Presto/Trino 可以将映射解析为键-值对,可选使用 MAP(VARCHAR, JSON) 作为目标类型),并为解析结果指定列名。Presto/Trino 可以将映射解析为键-值对:

WITH data(ts, value) AS (
    VALUES
        (from_iso8601_timestamp('2023-02-02T17:34:55.946Z'), VARCHAR '{"ETL_table_1":true,"ETL_table_2":true}'),
        (from_iso8601_timestamp('2023-02-02T13:30:11.882Z'), VARCHAR '{"ETL_table_3":true}'),
        (from_iso8601_timestamp('2023-02-02T20:12:22.116Z'), VARCHAR '{"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}')

)
select
   ts,
    t.*
FROM data
   CROSS JOIN UNNEST(CAST(json_parse(value) AS MAP(VARCHAR, JSON))) AS t(upstream_tablename, was_completed);
ts upstream_tablename was_completed
2023-02-02 17:34:55.946 UTC ETL_table_1 true
2023-02-02 17:34:55.946 UTC ETL_table_2 true
2023-02-02 13:30:11.882 UTC ETL_table_3 true
2023-02-02 20:12:22.116 UTC ETL_table_4 true
2023-02-02 20:12:22.116 UTC ETL_table_5 true
2023-02-02 20:12:22.116 UTC ETL_table_6 false
英文:

@martin-traverso's answer can be used with Athena engine v. 3 which is based on Trino, for both v.2 and v.3 the main trick of casting to map you have discovered, I would switch from using json_extract to json_parse (to transform from string to json), skip map_entries (Presto/Trino can unnest maps to key-value pairs, optionally use MAP(VARCHAR, JSON) as target type) and specify column names for unnest result, Presto/Trino can unnest maps to key-value pairs:

WITH data(ts, value) AS (
    VALUES
        (from_iso8601_timestamp('2023-02-02T17:34:55.946Z'), VARCHAR '{"ETL_table_1":true,"ETL_table_2":true}'),
        (from_iso8601_timestamp('2023-02-02T13:30:11.882Z'), VARCHAR '{"ETL_table_3":true}'),
        (from_iso8601_timestamp('2023-02-02T20:12:22.116Z'), VARCHAR '{"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}')

)
select
   ts,
    t.*
FROM data
   CROSS JOIN UNNEST(CAST(json_parse(value) AS MAP(VARCHAR, JSON))) AS t(upstream_tablename, was_completed);
ts upstream_tablename was_completed
2023-02-02 17:34:55.946 UTC ETL_table_1 true
2023-02-02 17:34:55.946 UTC ETL_table_2 true
2023-02-02 13:30:11.882 UTC ETL_table_3 true
2023-02-02 20:12:22.116 UTC ETL_table_4 true
2023-02-02 20:12:22.116 UTC ETL_table_5 true
2023-02-02 20:12:22.116 UTC ETL_table_6 false

答案2

得分: 0

你可以使用 UNNESTjson_query 的组合来完成这个任务。

首先,使用 json_query 将所有字段-值对转换为规范形式,形如 {"name": "xxx", value: yyy}。然后,将它们转换为一组行并展开它们成单独的行。

WITH data(ts, value) AS (
    VALUES
        (from_iso8601_timestamp('2023-02-02T17:34:55.946Z'), VARCHAR '{"ETL_table_1":true,"ETL_table_2":true}'),
        (from_iso8601_timestamp('2023-02-02T13:30:11.882Z'), VARCHAR '{"ETL_table_3":true}'),
        (from_iso8601_timestamp('2023-02-02T20:12:22.116Z'), VARCHAR '{"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}')
)
SELECT ts, t.name, t.value
FROM data, UNNEST(CAST(json_parse(json_query(value, 'strict $.keyvalue()' WITH ARRAY WRAPPER)) AS array(row(name varchar, value boolean))) t(name, value)
英文:

You can do this with a combination of UNNEST and json_query.

First, use json_query to convert all the field-value pairs to a normalized form
with the shape {"name": "xxx", value: yyy}. Then, convert these to an array of
rows and unnest them into individual rows.

WITH data(ts, value) AS (
    VALUES
        (from_iso8601_timestamp('2023-02-02T17:34:55.946Z'), VARCHAR '{"ETL_table_1":true,"ETL_table_2":true}'),
        (from_iso8601_timestamp('2023-02-02T13:30:11.882Z'), VARCHAR '{"ETL_table_3":true}'),
        (from_iso8601_timestamp('2023-02-02T20:12:22.116Z'), VARCHAR '{"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}')

)
SELECT ts, t.name, t.value
FROM data, UNNEST(CAST(json_parse(json_query(value, 'strict $.keyvalue()' WITH ARRAY WRAPPER)) AS array(row(name varchar, value boolean)))) t(name, value)

huangapple
  • 本文由 发表于 2023年3月31日 23:45:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900420.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定