Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

huangapple go评论63阅读模式
英文:

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

问题

在我的Azure数据工厂中,我需要从一个SFTP源复制数据,该源已将数据结构化为基于日期的目录,具有以下层次结构
年份 -> 月份 -> 日期 -> 文件

我已经创建了一个链接服务和一个二进制数据集,其中数据集的 "filesystem" 指向主机, "Directory" 指向包含年份目录的文件夹。例如:host/exampledir/yeardir/

yeardir 包含年份目录。

当我手动写入数据集,要求复制文件夹 "2015" 时,它会复制整个 "2015" 文件夹,但是如果我为目录设置参数,然后从复制活动中输入相同的文件夹路径,它会在我的Blob存储中创建一个名为 "2015" 的文件,其中不包含任何数据。

我目前的解决方法是创建一系列用于循环获取元数据的嵌套序列,以深入每个文件夹和子文件夹并复制各个文件的末尾。然而,期望的结果是单个二进制数据集复制每个文件夹,而不需要获取元数据。

这在数据工厂的范围内是否可能?

编辑:

手动文件路径(有效)

参数化文件路径

在复制活动中使用的属性

为了进一步说明情况,我已经尝试过手动将文件路径写入复制活动,如照片所示,我还尝试使用变量、参数的动态内容(使用基本文件路径和连接)以及将基本文件路径放入数据集以及 @dataset().filePath。这些解决方案都没有对我有用,要么什么都没有复制,要么创建了我之前提到的空文件。

接收端是与Azure Data Lake Storage Gen2链接的二进制数据集。

接收端文件路径

更新:

接受的答案是解决方案。我的问题在于当作为参数传递时,源数据集会在末尾包含一个换行符。我使用了 concat 来清除这个问题,从那以后就一直有效。

英文:

In my Azure data factory I need to copy data from an SFTP source that has structured the data into date based directories with the following hierarchy
year -> month -> date -> file

I have created a linked service and a binary dataset where the dataset "filesystem" points to the host and "Directory" points to the folder that contains the year directories. Ex: host/exampledir/yeardir/

with yeardir containing the year directories.

When I manually write into the dataset that I want the folder "2015" it will copy the entirety of the 2015 folder, however if I put a parameter for the directory and then input the same folder path from a copy activity it creates a file called "2015" inside of my blob storage that contains no data.

My current workaround is to make a nested sequence of get metadata for loops that drill into each folder and subfolder and copy the individual file ends. However the desired result is to instead have the single binary dataset copy each folder without the need for get metadata.

Is this possible within the scope of the data factory?

edit:

manual filepath that works

parameterized filepath

properties used in copy activity

To add further context I have tried manually writing the filepath into the copy activity as shown in the photo, I have also attempted to use variables, dynamic content for the parameter (using base filepath and concat) and also putting the base filepath into the dataset alongside @dataset().filePath. None of these solutions have worked for me so far and either copy nothing or create the empty file I mentioned earlier.

The sink is a binary dataset linked to Azure Data Lake Storage Gen2.

sink filepath

Update:

The accepted answer is the solution. My problem was that the source dataset when retrieved would have a newline at the end when passed as a parameter. I used concat to clean this up and this has worked since then.

答案1

得分: 0

由于给出exampledir/yeardir/2015对您来说运行得很完美,并且您想要复制exampledir/yeardir中的所有文件夹,您可以按照以下步骤进行操作:

  • 我已经添加了一个“获取元数据”活动,以获取文件夹exampledir/yeardir/的子项目(在我的演示中,我将路径设置为'maindir/yeardir')。

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • 这将给您所有年份文件夹的列表。我仅以2020和2021作为示例。

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • 现在,只需使用一个for each活动,其中项目值为获取元数据活动的子项目输出,我直接使用了复制活动。
@activity('Get Metadata1').output.childItems

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • 现在,在for each中,我有我的复制数据活动。对于源和接收器,我已经为路径创建了数据集参数。我为源路径提供了以下动态内容。
maindir/yeardir/@{item().name}

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • 对于接收器,我将输出目录设置为以下内容:
outputDir/@{item().name}

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • 由于手动提供路径exampledir/yeardir/2015有效,我们使用获取元数据活动获取了年份文件夹的列表。我们遍历了每个文件夹,并将每个文件夹的源路径设置为exampledir/yeardir/<current_iteration_year_folder>

  • 根据我提供接收器路径的方式,数据将被复制并包含内容。以下是一个参考图像。

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

英文:

Since giving exampledir/yeardir/2015 worked perfectly for you and you want to copy all the folders present in exampledir/yeardir, you can follow the below procedure:

  • I have taken a get metadata activity to get the child items of the folder exampledir/yeardir/ (In my demonstration, I have taken path as 'maindir/yeardir'.).

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • This will give you all the year folders present. I have taken only 2020 and 2021 as an example.

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • Now, with only one for each activity with items value as the child items output of get metadata activity, I have directly used copy activity.
@activity(&#39;Get Metadata1&#39;).output.childItems

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • Now, inside for each I have my copy data activity. For both source and sink, I have created a dataset parameter for paths. I have given the following dynamic content for source path.
maindir/yeardir/@{item().name}

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • For sink, I have given the output directory as follows:
outputDir/@{item().name}

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

  • Since giving path manually as exampledir/yeardir/2015 worked, we have got the list of year folders using get metadata activity. We looped through each of this and copy each folder with source path as exampledir/yeardir/&lt;current_iteration_year_folder&gt;.

  • Based on how I have given my sink path, the data will be copied with contents. The following is a reference image.

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

huangapple
  • 本文由 发表于 2023年1月9日 18:41:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056078.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定