2023年5月21日 01:04:03go评论65阅读模式

英文:

Airflow - Make http request for each row of Big Query Result Set

问题

我想使用Airflow DAG执行GCP Big Query，并针对结果集的每一行向一个端点发出请求。随着这个http请求的返回，我希望将其存储在GCS中，作为外部表。

我没有找到将GCP Big Query的结果集传递给其他运算符并对其进行迭代的方法。我唯一找到的方法是使用Python运算符，但我认为有更好的方法来完成这个任务。

英文:

I would like to use Airflow DAG to execute a GCP Big Query, and for each row of the result set, I will make a request to an endpoint. With the return of this http request, I would like to store it in GCS, as external table.

I didn't find a way to pass the result set of GCP Big query to other operator and iterate over it. The only way that I find, is to use a Python Operator, but I supposed that there is a better way to do that.

答案1

得分: 1

解决方案 1

我认为对于这种需求和用例，最简单的解决方案是在PythonOperator中执行不同的操作，就像你提到的那样。

在PythonOperator中，你可以使用Python BigQuery客户端来执行BigQuery作业。

你可以在一个Dict的List中检索结果，然后在你的端点中针对列表中的每个元素（或者可能更优化的方式）启动API请求，然后再次使用Python客户端将结果存储在GCS中作为外部表。

解决方案 2

你还可以混合常规运算符与PythonOperator，并使用xcom从前一个运算符中检索结果，例如：

import airflow
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator

def call_api(**kwargs):
    ti = kwargs['ti']
    query_results = ti.xcom_pull(task_ids='your_query')

    # 根据查询结果进行 API 调用

with airflow.DAG(
        "dag_id",
        default_args={},
        schedule_interval=None) as dag:
    your_query_task = BigQueryInsertJobOperator(
        task_id='your_query',
        configuration={
            "query": {
                "query": 'your_query',
                "useLegacySql": False,
            }
        },
        location='EU'
    )
    start_dag = DummyOperator(task_id='OK', dag=dag)

    api_call_task = PythonOperator(
        task_id="save_file_bq",
        op_kwargs={
            'dataset': 'dataset',
            'table': 'table'
        },
        python_callable=call_api
    )

    # DAG 的其余逻辑用于将结果上传到 GCS 作为外部表...

    start_dag >> api_call_task

至于DAG中的其余逻辑，你可以决定哪种方法对你来说最合适：

在PythonOperator中将结果上传到GCS
或者再次使用xcom在这种情况下使用常规运算符，但在这种情况下，你应该扩展内置运算符

英文:

Solution 1

I think for this this kind of need and use case, the most easy solution is to do the different operations in a PythonOperator as you mentioned.

In the PythonOperator, you can use Python BigQuery client to execute BigQuery job.

You can retrieve the result in a List of Dict, then launch the api requests in your endpoint, per element in the list (or maybe a more optimized way) and again use Python client to store the result in GCS as external table.

Solution 2

You can also mix usual operators with PythonOperator and use xcom to retrieve the result from the previous operator, example :

import airflow
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator


def call_api(**kwargs):
    ti = kwargs[&#39;ti&#39;]
    query_results = ti.xcom_pull(task_ids=&#39;your_query&#39;)

    # api call by query result


with airflow.DAG(
        &quot;dag_id&quot;,
        default_args={},
        schedule_interval=None) as dag:
    your_query_task = BigQueryInsertJobOperator(
        task_id=&#39;your_query&#39;,
        configuration={
            &quot;query&quot;: {
                &quot;query&quot;: &#39;your_query&#39;,
                &quot;useLegacySql&quot;: False,
            }
        },
        location=&#39;EU&#39;
    )
    start_dag = DummyOperator(task_id=&#39;OK&#39;, dag=dag)

    api_call_task = PythonOperator(
        task_id=&quot;save_file_bq&quot;,
        op_kwargs={
            &#39;dataset&#39;: &#39;dataset&#39;,
            &#39;table&#39;: &#39;table&#39;
        },
        python_callable=call_api
    )

    # The rest of DAG for upload the result in GCS as external table....

    start_dag &gt;&gt; api_call_task

For the rest of logic in the DAG, you can decide what is the best approach for you :

Upload the result in GCS in the PythonOperator
Or use an usual operator with xcom again but in this case, you should extend the built in operator

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Airflow – 为Big Query结果集的每一行进行HTTP请求

问题

答案1

解决方案 1

解决方案 2

Solution 1

Solution 2

使用正则表达式检查BigQuery上日期的正确格式。

BigQuery: 使用INFORMATION_SCHEMA获取表格描述

从Google BigQuery提取结果到云存储（使用Golang）

将文件从S3位置复制/移动到通过Airflow DAG挂载在EC2上的EBS。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论