如何在AzureML中触发基于事件的流水线?

huangapple go评论82阅读模式
英文:

How to trigger an event-based based pipeline in AzureML?

问题

我在AzureML中有一个已发布的流水线,用于预处理数据并训练新模型。我正在尝试使用基于事件的调度,以便每当工作区中注册了新数据集时,就会触发整个训练流水线。我正在使用Python的AzureML SDK-1。

根据文档中的信息,我尝试按照以下方式设置调度:

datastore = Datastore(workspace=ws, name="workspaceblobstore")

reactive_schedule = Schedule.create(ws, name="MyReactiveSchedule", description="Based on input file change.", pipeline_id=pipeline_id, experiment_name=experiment_name, datastore=datastore, polling_interval=2)

当我检查调度的状态时,它显示为活动状态,但是,当我在与工作区关联的Blob存储中注册新的数据集时,即使等待超过5分钟,什么都不会发生。

有人可以帮助我理解如何在注册新数据集时触发流水线吗?

英文:

I have a published pipeline in AzureML that preprocess the data and train a new model. I am trying an event-based schedule so that whenever a new dataset is registered in the workspace, it triggers the whole training pipeline. I am using the python AzureML SDK-1.

Using the information from the docs, I tried setting up the schedule as follows:

datastore = Datastore(workspace=ws, name="workspaceblobstore")

reactive_schedule = Schedule.create(ws, name="MyReactiveSchedule", description="Based on input file change.", pipeline_id=pipeline_id, experiment_name=experiment_name, datastore=datastore, polling_interval=2)

When I check the status of the schedule, it says its active, however, when I register a new dataset in the blob storage associated with the workspace, nothing happens even if I wait for more than 5 mins.

Can someone help me understand how does this work in terms of triggering the pipeline when a new dataset is registered?

答案1

得分: 0

AzureML对数据存储中的数据更改做出反应,而不是对数据集注册做出反应。如果您使用相同的数据路径注册了数据集的新版本,并且数据本身没有发生变化,则可能不会触发管道。在创建反应式调度时,您可以指定path_on_datastore参数来定义要监视的文件夹或文件。

如果您不指定此参数,它将默认为数据存储的根目录。确保您正在更改/添加的数据位于正确的位置。

作为一个简单的测试,尝试通过Azure门户(或其他方法)手动将文件添加到工作区blob存储中的监视路径,并查看是否触发了管道。这可以帮助区分数据集注册问题和数据存储监视问题。

下面是一个基于更改的调度示例。在这个示例中,当在特定的blob容器中添加新的mp3文件时,我会触发管道。

from azureml.data.datapath import DataPath

datastore = Datastore.get(ws, datastore_name='<your-datastore>')

reactive_schedule = Schedule.create(ws, 
                                    name="R-Schedule", 
                                    description="基于输入文件更改。",
                                    pipeline_id=published_pipeline.id, 
                                    experiment_name=experiment_name, 
                                    datastore=datastore,
                                    polling_interval=1,
                                    data_path_parameter_name="input_mp3_data",
                                    path_on_datastore='r-pipeline-data/mp3/' 
                                   )
英文:

AzureML reacts to data changes in the datastore, not to dataset registrations. If you register a new version of a dataset using the same data path, and the data itself hasn't changed, the pipeline may not be triggered, When creating a reactive schedule, you can specify a path_on_datastore parameter to define which folder or file to monitor.

If you don't specify this parameter, it will default to the root of the datastore. Ensure the data you're changing/adding is in the correct location.

As a simple test, try manually adding a file to the monitored path in your workspaceblobstore through the Azure portal (or another method) and see if that triggers the pipeline. This can help differentiate between issues with dataset registration and issues with the datastore monitoring.

Below is an example for a change based schedule. In this I am triggerring the pipeline when a new mp3 file is added in a specific blob container.

from azureml.data.datapath import DataPath

datastore = Datastore.get(ws, datastore_name=&#39;&lt;your-datastore&gt;&#39;)

reactive_schedule = Schedule.create(ws, 
                                    name=&quot;R-Schedule&quot;, 
                                    description=&quot;Based on input file change.&quot;,
                                    pipeline_id=published_pipeline.id, 
                                    experiment_name=experiment_name, 
                                    datastore=datastore,
                                    polling_interval=1,
                                    data_path_parameter_name=&quot;input_mp3_data&quot;,
                                    path_on_datastore=&#39;r-pipeline-data/mp3/&#39; 
                                   )

huangapple
  • 本文由 发表于 2023年8月9日 07:40:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76863740.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定