遍历结构体列以获取条件结果作为新列。

huangapple go评论120阅读模式
英文:

Looping through struct column to get conditional outcome as new columns

问题

以下是您要翻译的代码部分:

Say I have a table in BQ called `rating` with a struct column called `rating_record`. Schema as:

[
  {
    "name": "id",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  },
  {
    "name": "rating_record",
    "mode": "NULLABLE",
    "type": "RECORD",
    "description": null,
    "fields": [
      {
        "name": "high_drop",
        "type": "BOOLEAN",
        "fields": []
      },
      {
        "name": "medium_bump",
        "type": "BOOLEAN",
        "fields": []
      }
    ]
  }
]

rating_record 包含字段 high_dropmedium_bump,可能会有许多带有后缀 _drop_bump 的字段,其值为 true 或 false。我想要使用一个 dbt 宏来迭代此记录类型(struct)字段,以创建两个新列,分别对应 id,称为 drop_reasonsbump_reasons - 在这种情况下,如果值为 true,则 drop_reasons 将等于 'high_drop'

我尝试使用 sql_statement 来迭代记录,并使用 dbt_utils.get_query_results_as_dict 来获取结果,但无法相应地创建列。

{% set sql_statement %}
    select id, rating_record from {{ ref('source_table' }}
{% endset %}

{%- set ids_and_ratings = dbt_utils.get_query_results_as_dict(sql_statement) -%}

select

    {% for id in ids_and_ratings['id'] | unique -%}
        {% set bump_reasons = [] %}
        {% set drop_reasons = [] %}
        {% for rating_record in ids_and_ratings['rating_record'] | unique -%}
            {% for key, value in fromjson(rating_record).items() -%}
                {% if key.endswith('bump') and value is sameas true %}
                    {{ bump_reasons.append(key) }}
                {% elif key.endswith('drop') and value is sameas true %}
                    {{ drop_reasons.append(key) }}
                {% endif %}
            {% endfor %}
        {% endfor %}
        {{ print(drop_reasons) }}
    {% endfor %}

from {{ ref('source_table' }}

请注意,这是您的代码的翻译部分,我已经省略了您的要求以及问题。如果您需要更多信息或有其他问题,请随时告诉我。

英文:

Say I have a table in BQ called rating with a struct column called rating_record. Schema as:

[
  {
    "name": "id",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  },
  {
    "name": "rating_record",
    "mode": "NULLABLE",
    "type": "RECORD",
    "description": null,
    "fields": [
      {
        "name": "high_drop",
        "type": "BOOLEAN",
        "fields": []
      },
      {
        "name": "medium_bump",
        "type": "BOOLEAN",
        "fields": []
      }
]

rating_record contains fields high_drop and medium_bump and there could be many fields with suffix _drop and _bump with true or false values. I want to iterate this record type (struct) field using a dbt macro to create two new columns against an id called drop_reasons & bump_reasons - drop_reasons in this case would be = 'high_drop' if the value is true.

I tried to iterate the record with a sql_statement and using dbt_utils.get_query_results_as_dict to get the outcome but unable to create columns accordingly.

{% set sql_statement %}
    select id, rating_record from {{ ref('source_table' }}
{% endset %}

{%- set ids_and_ratings = dbt_utils.get_query_results_as_dict(sql_statement) -%}

select

    {% for id in ids_and_ratings['id'] | unique -%}
        {% set bump_reasons = [] %}
        {% set drop_reasons = [] %}
        {% for rating_record in ids_and_ratings['rating_record'] | unique -%}
            {% for key, value in fromjson(rating_record).items() -%}
                {% if key.endswith('bump') and value is sameas true %}
                    {{ bump_reasons.append(key) }}
                {% elif key.endswith('drop') and value is sameas true %}
                    {{ drop_reasons.append(key) }}
                {% endif %}
            {% endfor %}
        {% endfor %}
        {{ print(drop_reasons) }}
    {% endfor %}

from {{ ref('source_table' }}

答案1

得分: 0

I suppose values in the fields of the table schema is a single value rather than an array.
表模式字段中的值应该是单个值而不是数组。

get_query_results_as_dict returns a key of the column and the value to be a list of the column value.
get_query_results_as_dict返回列的键以及值,值是列值的列表。

In the jinja code the first loop iterate all the ids, for each id the second loop iterate all rating_record struct for all ids but not of the id in the first loop.
在jinja代码中,第一个循环迭代所有的id,对于每个id,第二个循环迭代所有id的评分记录结构,但不包括第一个循环中的id。

From the logic of your jinja code, it can be done using BigQuery.
根据您的jinja代码逻辑,可以使用BigQuery来完成。

The dummy data is
虚拟数据如下:

Added a macro to get keys in the rating_record struct
添加了一个宏来获取评分记录结构中的键

The model is
模型如下:

where
在这里

will result in a table with columns id, tf which contains true/false, and bump_drop with values of something_bump/something_drop.
将生成一个包含列id、tf(包含true/false)和bump_drop(其值为something_bump/something_drop)的表。

英文:

I suppose values in the fields of the table schema is a single value rather than an array.
get_query_results_as_dict returns a key of the column and the value to be a list of the column value. In the jinja code the first loop iterate all the ids, for each id the second loop iterate all rating_record struct for all ids but not of the id in the first loop.
From the logic of your jinja code, it can be done using BigQuery.
The dummy data is

insert into database.table_name values
    ("01", STRUCT(True, False, True, True)),
    ("02", STRUCT(False, False, True, False)),
    ("03", STRUCT(False, True, True, False))

Added a macro to get keys in the rating_record struct

{% macro get_struct_fields() %}
{% set query %}
SELECT split(field_path, ".")[OFFSET(1)] as fields
FROM `project_id`.`region-us`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS 
WHERE table_name = "struct_tbl"
and column_name = "rating_record"
and data_type = "BOOL" 
{% endset %}
{% set results = run_query(query) %}
{% if execute %}
    {% set fields = results.columns[0].values() %}
{% else %}
    {% set fields = [] %}
{% endif %}
{%do log(fields, info=true) %}
{{ return(fields)}}
{% endmacro %} 

The model is

{% set fields = get_struct_fields() %}
{% set fields = fields | join(", ") %}
select *
    from
    (select id, split(bump_drop, "_")[offset(1)] as bump_drop, bump_drop as 
    reason
    from
    (select id, rating_record.* from `project_id.database.table_name`)
    unpivot (tf for bump_drop in ({{ fields }}))
    where tf is true
    )
    pivot (array_agg(reason ignore nulls) as reasons for bump_drop in ("bump", 
    "drop"))

where

select * from (
select id, rating_record.* from `project_id.database.table_name`)
unpivot (tf for bump_drop in (high_bump, high_drop, medium_bump, medium_drop))

will result in a table with columns id, tf which contains true/false, and bump_drop with values of something_bump/something_drop.

Then select only true part and separate out the 'bump' and 'drop'.
The unpivot will put reasons for bump and drop separately into an array for each id.

huangapple
  • 本文由 发表于 2023年5月10日 21:16:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218917.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定