英文:
Iterate on files generated in single DataFlows for send to a API. - Azure Data Factory
问题
我的问题非常复杂,我觉得我没有一个好的方法,但你会告诉我。我相信你。;)
我在Azure Data Factory中。
我有一个生成多个文件的Dataflow。我使用“将文件名作为列数据”进行了分区。这些文件是csv格式的,存储在Azure存储中。
之后,在管道中,我将使用GetMetadata和参数“Child Items”获取数据集中的所有文件。这里的“Resource”是我在Dataflow中使用的数据集。
然后,在ForEach循环中,我获取数据集中的所有文件,也就是Azure存储中特定文件夹中的所有文件,并将其发送到一个API。这个API会通过传递文件名来处理文件,并将文件移动到存档文件夹中。
问题是,由于处理过程中可能出现错误,API可能不会将文件移动到存档文件夹中。如果这种情况发生,下一次管道将再次调用该文件,因为它始终位于数据集文件夹中。
我希望能更安全一些。我希望ForEach只处理由Dataflow刚刚创建的文件。
如果你有办法实现这个,我会很感激。:)
老实说,我不确定这样做是否是一个好方法。
我尝试将文件名传递给GetMetadata,并希望能够对文件名列表进行迭代处理。但我不认为发送一个列表是可能的。
英文:
My question is quite complicated and I think I don't have a good approach but you will tell me. I'm sure.
I'm in Azure Data Factory.
I have got a Dataflow that generates several files. I did partition with "Name file as column data".
These files are csv and are stored in Azure Storage.
After that, in the pipeline I will get all files in the Dataset by using GetMetadata with argument "Child Items". "Resource" is the dataset that I used in Data flow too.
After that in the ForEach, I get all files in the Dataset that mean all files in specific folder in Azure Storage and I send to an API.
This API processes the file by passing fileName do things and move the files in Archive folder.
The problem is there is a possibility that API don't move file to the Archive folder because of there is a error in the process. If it happens next time the pipeline will call again the file because it's always in the Dataset folder.
I would like to be more secure. I would like the ForEach to only process the files just created by the Dataflow.
If you got idea to do that I will get it.
To be honest, I am not sure it's good approach like it's done.
I tried to pass filename to GetMetadata and I wanted to iterate on the list of filenames with that. But I don't think it's possible to send a list.
答案1
得分: 0
我可以通过在获取元数据活动中使用“按最后修改日期筛选”来实现您的要求,并感谢**@[Joel Cochran](https://stackoverflow.com/users/75838/joel-cochran "7,069 reputation")**提供的建议。
这些是管道运行之前的文件。
在数据流活动之前创建一个字符串变量,并使用@utcNow()
作为“按最后修改日期筛选”的开始日期,然后在结束日期处使用@utcNow()
,如下所示。
这将筛选由数据流创建的文件。
管道运行后,目标位置中的文件如下所示。
获取元数据活动的子项数组。
您可以将此列表传递给ForEach活动。
英文:
I am able to achieve your requirement by using Filter by Last modified
in Get meta data activity approach and credit to @[Joel Cochran](https://stackoverflow.com/users/75838/joel-cochran "7,069 reputation") for the suggestion.
These are files before the pipeline run.
Create a string variable with @utcNow()
before the Dataflow activity. After the dataflow activity, Use the variable as the start date of the Filter by Last modified
of the Get meta data activity and at the end date give @utcNow()
like below.
This will filter the files which are created by the dataflow.
Files in my target location after pipeline run.
Child Items array of Get meta data activity.
You can pass this list to your ForEach activity.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论