英文:
Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized
问题
在我的Azure数据工厂中,我需要从一个SFTP源复制数据,该源已将数据结构化为基于日期的目录,具有以下层次结构
年份 -> 月份 -> 日期 -> 文件
我已经创建了一个链接服务和一个二进制数据集,其中数据集的 "filesystem" 指向主机, "Directory" 指向包含年份目录的文件夹。例如:host/exampledir/yeardir/
yeardir 包含年份目录。
当我手动写入数据集,要求复制文件夹 "2015" 时,它会复制整个 "2015" 文件夹,但是如果我为目录设置参数,然后从复制活动中输入相同的文件夹路径,它会在我的Blob存储中创建一个名为 "2015" 的文件,其中不包含任何数据。
我目前的解决方法是创建一系列用于循环获取元数据的嵌套序列,以深入每个文件夹和子文件夹并复制各个文件的末尾。然而,期望的结果是单个二进制数据集复制每个文件夹,而不需要获取元数据。
这在数据工厂的范围内是否可能?
编辑:
为了进一步说明情况,我已经尝试过手动将文件路径写入复制活动,如照片所示,我还尝试使用变量、参数的动态内容(使用基本文件路径和连接)以及将基本文件路径放入数据集以及 @dataset().filePath。这些解决方案都没有对我有用,要么什么都没有复制,要么创建了我之前提到的空文件。
接收端是与Azure Data Lake Storage Gen2链接的二进制数据集。
更新:
接受的答案是解决方案。我的问题在于当作为参数传递时,源数据集会在末尾包含一个换行符。我使用了 concat 来清除这个问题,从那以后就一直有效。
英文:
In my Azure data factory I need to copy data from an SFTP source that has structured the data into date based directories with the following hierarchy
year -> month -> date -> file
I have created a linked service and a binary dataset where the dataset "filesystem" points to the host and "Directory" points to the folder that contains the year directories. Ex: host/exampledir/yeardir/
with yeardir containing the year directories.
When I manually write into the dataset that I want the folder "2015" it will copy the entirety of the 2015 folder, however if I put a parameter for the directory and then input the same folder path from a copy activity it creates a file called "2015" inside of my blob storage that contains no data.
My current workaround is to make a nested sequence of get metadata for loops that drill into each folder and subfolder and copy the individual file ends. However the desired result is to instead have the single binary dataset copy each folder without the need for get metadata.
Is this possible within the scope of the data factory?
edit:
properties used in copy activity
To add further context I have tried manually writing the filepath into the copy activity as shown in the photo, I have also attempted to use variables, dynamic content for the parameter (using base filepath and concat) and also putting the base filepath into the dataset alongside @dataset().filePath. None of these solutions have worked for me so far and either copy nothing or create the empty file I mentioned earlier.
The sink is a binary dataset linked to Azure Data Lake Storage Gen2.
Update:
The accepted answer is the solution. My problem was that the source dataset when retrieved would have a newline at the end when passed as a parameter. I used concat to clean this up and this has worked since then.
答案1
得分: 0
由于给出exampledir/yeardir/2015
对您来说运行得很完美,并且您想要复制exampledir/yeardir
中的所有文件夹,您可以按照以下步骤进行操作:
- 我已经添加了一个“获取元数据”活动,以获取文件夹
exampledir/yeardir/
的子项目(在我的演示中,我将路径设置为'maindir/yeardir')。
- 这将给您所有年份文件夹的列表。我仅以2020和2021作为示例。
- 现在,只需使用一个for each活动,其中项目值为获取元数据活动的子项目输出,我直接使用了复制活动。
@activity('Get Metadata1').output.childItems
- 现在,在for each中,我有我的复制数据活动。对于源和接收器,我已经为路径创建了数据集参数。我为源路径提供了以下动态内容。
maindir/yeardir/@{item().name}
- 对于接收器,我将输出目录设置为以下内容:
outputDir/@{item().name}
-
由于手动提供路径
exampledir/yeardir/2015
有效,我们使用获取元数据活动获取了年份文件夹的列表。我们遍历了每个文件夹,并将每个文件夹的源路径设置为exampledir/yeardir/<current_iteration_year_folder>
。 -
根据我提供接收器路径的方式,数据将被复制并包含内容。以下是一个参考图像。
英文:
Since giving exampledir/yeardir/2015
worked perfectly for you and you want to copy all the folders present in exampledir/yeardir
, you can follow the below procedure:
- I have taken a
get metadata
activity to get the child items of the folderexampledir/yeardir/
(In my demonstration, I have taken path as 'maindir/yeardir'.).
- This will give you all the year folders present. I have taken only 2020 and 2021 as an example.
- Now, with only one for each activity with items value as the child items output of get metadata activity, I have directly used copy activity.
@activity('Get Metadata1').output.childItems
- Now, inside for each I have my copy data activity. For both source and sink, I have created a dataset parameter for paths. I have given the following dynamic content for source path.
maindir/yeardir/@{item().name}
- For sink, I have given the output directory as follows:
outputDir/@{item().name}
-
Since giving path manually as
exampledir/yeardir/2015
worked, we have got the list of year folders using get metadata activity. We looped through each of this and copy each folder with source path asexampledir/yeardir/<current_iteration_year_folder>
. -
Based on how I have given my sink path, the data will be copied with contents. The following is a reference image.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论