问题

我在AzureML中有一个已发布的流水线，用于预处理数据并训练新模型。我正在尝试使用基于事件的调度，以便每当工作区中注册了新数据集时，就会触发整个训练流水线。我正在使用Python的AzureML SDK-1。

根据文档中的信息，我尝试按照以下方式设置调度：

datastore = Datastore(workspace=ws, name="workspaceblobstore")

reactive_schedule = Schedule.create(ws, name="MyReactiveSchedule", description="Based on input file change.", pipeline_id=pipeline_id, experiment_name=experiment_name, datastore=datastore, polling_interval=2)

当我检查调度的状态时，它显示为活动状态，但是，当我在与工作区关联的Blob存储中注册新的数据集时，即使等待超过5分钟，什么都不会发生。

有人可以帮助我理解如何在注册新数据集时触发流水线吗？

英文:

I have a published pipeline in AzureML that preprocess the data and train a new model. I am trying an event-based schedule so that whenever a new dataset is registered in the workspace, it triggers the whole training pipeline. I am using the python AzureML SDK-1.

Using the information from the docs, I tried setting up the schedule as follows:

datastore = Datastore(workspace=ws, name=&quot;workspaceblobstore&quot;)

reactive_schedule = Schedule.create(ws, name=&quot;MyReactiveSchedule&quot;, description=&quot;Based on input file change.&quot;, pipeline_id=pipeline_id, experiment_name=experiment_name, datastore=datastore, polling_interval=2)

When I check the status of the schedule, it says its active, however, when I register a new dataset in the blob storage associated with the workspace, nothing happens even if I wait for more than 5 mins.

Can someone help me understand how does this work in terms of triggering the pipeline when a new dataset is registered?

答案1

得分: 0

AzureML对数据存储中的数据更改做出反应，而不是对数据集注册做出反应。如果您使用相同的数据路径注册了数据集的新版本，并且数据本身没有发生变化，则可能不会触发管道。在创建反应式调度时，您可以指定path_on_datastore参数来定义要监视的文件夹或文件。

如果您不指定此参数，它将默认为数据存储的根目录。确保您正在更改/添加的数据位于正确的位置。

作为一个简单的测试，尝试通过Azure门户（或其他方法）手动将文件添加到工作区blob存储中的监视路径，并查看是否触发了管道。这可以帮助区分数据集注册问题和数据存储监视问题。

下面是一个基于更改的调度示例。在这个示例中，当在特定的blob容器中添加新的mp3文件时，我会触发管道。

from azureml.data.datapath import DataPath

datastore = Datastore.get(ws, datastore_name='<your-datastore>')

reactive_schedule = Schedule.create(ws, 
                                    name="R-Schedule", 
                                    description="基于输入文件更改。",
                                    pipeline_id=published_pipeline.id, 
                                    experiment_name=experiment_name, 
                                    datastore=datastore,
                                    polling_interval=1,
                                    data_path_parameter_name="input_mp3_data",
                                    path_on_datastore='r-pipeline-data/mp3/' 
                                   )

英文:

AzureML reacts to data changes in the datastore, not to dataset registrations. If you register a new version of a dataset using the same data path, and the data itself hasn't changed, the pipeline may not be triggered, When creating a reactive schedule, you can specify a path_on_datastore parameter to define which folder or file to monitor.

If you don't specify this parameter, it will default to the root of the datastore. Ensure the data you're changing/adding is in the correct location.

As a simple test, try manually adding a file to the monitored path in your workspaceblobstore through the Azure portal (or another method) and see if that triggers the pipeline. This can help differentiate between issues with dataset registration and issues with the datastore monitoring.

Below is an example for a change based schedule. In this I am triggerring the pipeline when a new mp3 file is added in a specific blob container.

from azureml.data.datapath import DataPath

datastore = Datastore.get(ws, datastore_name=&#39;&lt;your-datastore&gt;&#39;)

reactive_schedule = Schedule.create(ws, 
                                    name=&quot;R-Schedule&quot;, 
                                    description=&quot;Based on input file change.&quot;,
                                    pipeline_id=published_pipeline.id, 
                                    experiment_name=experiment_name, 
                                    datastore=datastore,
                                    polling_interval=1,
                                    data_path_parameter_name=&quot;input_mp3_data&quot;,
                                    path_on_datastore=&#39;r-pipeline-data/mp3/&#39; 
                                   )

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在AzureML中触发基于事件的流水线？

问题

答案1

在COSMOS中投影一个表，并将其与“Not In”子句一起使用。

可以在Azure YAML管道的部署作业类型中设置多个环境值吗？

在.NET 7中通过appsetting文件设置变量的问题

在Azure AD B2C中启用电子邮件地址或电话号码作为用户标识符。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论