英文:
How to use iterations and conditions activities in Azure Data Factory
问题
我有10000个文件在ADLS文件夹中。
我使用"获取元数据"活动获取所有文件名,并将其传递给"for each"循环。在第一次执行中,它应该选择1000个文件并将它们合并为一个单独的文件;在下一次运行中,它应该选择另一组文件,即2000个文件,这是在获取总文件数后进行的。在数据工厂中是否有任何方法可以批量运行文件集?
英文:
I have 10000 file in adls folder
All filename I'm getting by using the get metadata activity and pass it to the for each loop in the first execution it should pick 1000 files and merge it to one single file in the next run it should pick another set of files 2000 after getting the total file count.Is their any work around in the data factory to run the set of files in the batches ?
答案1
得分: 0
迭代并合并批量的1000个文件,首先需要将需要批量处理的文件名/路径分组到一个不同的文件中。然后使用ADF复制活动中的文件列表选项来合并这些文件。
按照以下步骤进行操作(我尝试了总共5个文件,将其分组为2个文件):
- 添加一个数据流活动并创建一个新的数据流。
- 在源中添加包含所有10000个文件的数据集,并在源选项中将通配符路径设置为
*
,并使用用于存储文件名的列在数据集中添加文件名列。 - 添加一个聚合转换,在分组中选择文件名列,并在聚合中对任意列执行聚合任务。
- 然后添加一个派生列活动,使用表达式
dropLeft(filename,1)
来删除文件名/路径中的第一个反斜杠。 - 在目标中添加一个数据集,用于存储这些文件,并取消选中将第一行作为标题。
- 将跳过行计数设置为1。
- 在设置中,将文件名选项设置为
Pattern
,将Pattern设置为file[1]000.csv
。 - 在映射中只选择文件名列。
- 在优化中,根据文件名列设置分区为动态范围,并设置分区数为10。
- 然后使用获取元数据活动获取包含需要合并的文件列表的文件列表。
- 将这些文件传递给foreach活动,循环遍历
@activity('Get Metadata1').output.childItems
。 - 在foreach活动中添加复制活动。
- 在文件路径类型中,将文件路径类型设置为文件列表,在文件列表路径中以动态方式获取文件。
- 在复制活动数据集中添加包含需要合并的文件的数据集。
- 添加目标并将复制行为设置为合并文件。
这样,它将合并从数据流中获取的所有文件列表。
请注意,以上是一段代码,我已经为您翻译了其中的内容。
英文:
To Iterate and merge the batch of 1000 files you need to first group the file names/ path in a different file that need to be processed in a batch. the using list of files option in ADF copy activity merge the files.
Follow the below process (I tried with total 5 files by grouping 2 files):
- Take a dataflow activity add new dataflow.
- In source add the dataset where all 10000 files are located and in source options set Wildcard paths as
*
and with Column to store file name add file name column in data set.
- Take an aggregate transformation and in group by select filename column and in aggregates perform any aggregate task on any column.
- Then take a derived column activity to remove the first backslash from file name/path with expression
dropLeft(filename,1)
- In sink add dataset where you need to store these files with unchecking the First row as header
- Set Skip line count to 1
- In setting set the File name option as
Pattern
and Pattern asfile[1]000.csv
- In mapping select only filename column.
- In Optimize set the partitioning as dynamic range based on filename column and set the no of partotion in yoyr case 10.
Output of the is as below:
-
Then use get metadata activity to get the list of files where each file contains list of files that need to be merged.
-
Then pass these files to foreach activity to loop over
@activity('Get Metadata1').output.childItems
.
-
In for each activity take copy activity
- Set File path type is List of files in Path to files list set the dynamic way to get the files.
- In copy activity dataset add the data set that con tail files need to be merged.
- Add sink and set copy behavior as merge files.
Output:
It will merge all the list of files which we got from dataflow
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论