2023年6月22日 15:23:43go评论72阅读模式

英文:

Get all folder names in subfolders Azure Data factory

问题

我在Data Lake中有以下的文件夹结构；

datasetname/fullload/year/month/day/hour/min/sec/data

我无法创建Azure函数或Databricks，只能使用简单的ADF活动。

我想要从我的ParentFolder目录（datasetname/fullload）的所有子文件夹中获取最新的文件夹名称。我尝试了使用GetMetadata -> 设置变量然后循环，但仍然不起作用。

我需要获取Blob存储中文件夹的最新路径。

谢谢。

英文:

I have a below Folder Structure in Data Lake;

datasetname/fullload/year/month/day/hour/min/sec/data

I can't create Azure function or databrick. just simple adf activity

I want to get the latest folder names from all subfolders of my ParentFolder directory (datasetname/fullload) . I tried GetMetadata -> set variable then loop but still not working

I need to get the latest path of the folder in the blob storage

Thanks

答案1

得分: 1

以下是翻译好的部分：

由于您需要最新的数据，而文件夹大多是数字，您可以在每个子文件夹中找到最大的数字，以找到最新的数据。
我的文件数据如下图所示：

获取Azure数据工厂子文件夹中的所有文件夹名称。

要找到最大的文件夹，我使用了2个流程。 pipeline1 用于迭代并获取子项，直到子项不存在。 pipeline2 用于查找特定文件夹中子文件夹名称列表的最大数值。
以下是 pipeline1 的管道 JSON：

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "get path",
                "type": "Until",
                ...
            },
            {
                "name": "Set flag",
                "type": "SetVariable",
                ...
            },
            {
                "name": "Set path",
                "type": "SetVariable",
                ...
            }
        ],
        "variables": {
            "path": {
                "type": "String"
            },
            "flag": {
                "type": "String"
            },
            "values": {
                "type": "Array"
            },
            "max_val": {
                "type": "String"
            },
            "tp": {
                "type": "String"
            }
        },
        "annotations": []
    }
}

以下是 pipeline2 的管道 JSON：

{
    "name": "pipeline2",
    "properties": {
        "activities": [
            {
                "name": "make array of values",
                "type": "ForEach",
                ...
            },
            {
                "name": "return max",
                "type": "SetVariable",
                ...
            }
        ],
        "parameters": {
            "array_to_find_max": {
                "type": "array",
                "defaultValue": [
                    {
                        "name": "2022",
                        "type": "Folder"
                    },
                    {
                        "name": "2023",
                        "type": "Folder"
                    }
                ]
            }
        },
        "variables": {
            "values": {
                "type": "Array"
            },
            "max_val": {
                "type": "String"
            },
            "tp": {
                "type": "String"
            }
        },
        "annotations": []
    }
}

我用于获取元数据活动的数据集配置如下。在我的情况下，path 的初始值为 data/f1/ff1，并且其值将被更新（最大文件夹名称将被连接）：

获取Azure数据工厂子文件夹中的所有文件夹名称。

当我运行这个管道时，我得到了所需的结果。在 until 循环停止后，变量 path 包含所需的路径，即最新数据的路径：

获取Azure数据工厂子文件夹中的所有文件夹名称。

英文:

Since you need the latest data and the folders are mostly numbers, you can find the greatest number in each sub folder to find the latest data.
I have file data as shown in the below image:

获取Azure数据工厂子文件夹中的所有文件夹名称。

To find the greatest folder, I have used 2 pipelines. pipeline1 is used to iterate and get child items until child items don't exist. pipeline2 is to find the maximum number for the list of sub-folder names in a particular folder.
The following is the pipeline JSON for pipeline1:

{
&quot;name&quot;: &quot;pipeline1&quot;,
&quot;properties&quot;: {
&quot;activities&quot;: [
{
&quot;name&quot;: &quot;get path&quot;,
&quot;type&quot;: &quot;Until&quot;,
&quot;dependsOn&quot;: [
{
&quot;activity&quot;: &quot;Set flag&quot;,
&quot;dependencyConditions&quot;: [
&quot;Succeeded&quot;
]
}
],
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;expression&quot;: {
&quot;value&quot;: &quot;@equals(variables(&#39;flag&#39;),&#39;true&#39;)&quot;,
&quot;type&quot;: &quot;Expression&quot;
},
&quot;activities&quot;: [
{
&quot;name&quot;: &quot;Get sub folders&quot;,
&quot;type&quot;: &quot;GetMetadata&quot;,
&quot;dependsOn&quot;: [],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;dataset&quot;: {
&quot;referenceName&quot;: &quot;root&quot;,
&quot;type&quot;: &quot;DatasetReference&quot;,
&quot;parameters&quot;: {
&quot;path&quot;: {
&quot;value&quot;: &quot;@variables(&#39;path&#39;)&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
},
&quot;fieldList&quot;: [
&quot;childItems&quot;
],
&quot;storeSettings&quot;: {
&quot;type&quot;: &quot;AzureBlobFSReadSettings&quot;,
&quot;enablePartitionDiscovery&quot;: false
},
&quot;formatSettings&quot;: {
&quot;type&quot;: &quot;DelimitedTextReadSettings&quot;
}
}
},
{
&quot;name&quot;: &quot;If Condition1&quot;,
&quot;type&quot;: &quot;IfCondition&quot;,
&quot;dependsOn&quot;: [
{
&quot;activity&quot;: &quot;Get sub folders&quot;,
&quot;dependencyConditions&quot;: [
&quot;Succeeded&quot;
]
}
],
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;expression&quot;: {
&quot;value&quot;: &quot;@greater(length(activity(&#39;Get sub folders&#39;).output.childItems),0)&quot;,
&quot;type&quot;: &quot;Expression&quot;
},
&quot;ifFalseActivities&quot;: [
{
&quot;name&quot;: &quot;Set variable1&quot;,
&quot;type&quot;: &quot;SetVariable&quot;,
&quot;dependsOn&quot;: [],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;flag&quot;,
&quot;value&quot;: {
&quot;value&quot;: &quot;true&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
}
],
&quot;ifTrueActivities&quot;: [
{
&quot;name&quot;: &quot;get latest&quot;,
&quot;type&quot;: &quot;ExecutePipeline&quot;,
&quot;dependsOn&quot;: [],
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;pipeline&quot;: {
&quot;referenceName&quot;: &quot;pipeline2&quot;,
&quot;type&quot;: &quot;PipelineReference&quot;
},
&quot;waitOnCompletion&quot;: true,
&quot;parameters&quot;: {
&quot;array_to_find_max&quot;: {
&quot;value&quot;: &quot;@activity(&#39;Get sub folders&#39;).output.childItems&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
}
},
{
&quot;name&quot;: &quot;append max to path&quot;,
&quot;type&quot;: &quot;SetVariable&quot;,
&quot;dependsOn&quot;: [
{
&quot;activity&quot;: &quot;get latest&quot;,
&quot;dependencyConditions&quot;: [
&quot;Succeeded&quot;
]
}
],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;tp&quot;,
&quot;value&quot;: {
&quot;value&quot;: &quot;@{variables(&#39;path&#39;)}/@{activity(&#39;get latest&#39;).output.pipelineReturnValue.max_val}&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
},
{
&quot;name&quot;: &quot;update path&quot;,
&quot;type&quot;: &quot;SetVariable&quot;,
&quot;dependsOn&quot;: [
{
&quot;activity&quot;: &quot;append max to path&quot;,
&quot;dependencyConditions&quot;: [
&quot;Succeeded&quot;
]
}
],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;path&quot;,
&quot;value&quot;: {
&quot;value&quot;: &quot;@variables(&#39;tp&#39;)&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
}
]
}
}
],
&quot;timeout&quot;: &quot;0.12:00:00&quot;
}
},
{
&quot;name&quot;: &quot;Set flag&quot;,
&quot;type&quot;: &quot;SetVariable&quot;,
&quot;dependsOn&quot;: [
{
&quot;activity&quot;: &quot;Set path&quot;,
&quot;dependencyConditions&quot;: [
&quot;Succeeded&quot;
]
}
],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;flag&quot;,
&quot;value&quot;: {
&quot;value&quot;: &quot;false&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
},
{
&quot;name&quot;: &quot;Set path&quot;,
&quot;type&quot;: &quot;SetVariable&quot;,
&quot;dependsOn&quot;: [],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;path&quot;,
&quot;value&quot;: {
&quot;value&quot;: &quot;data/f1/ff1&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
}
],
&quot;variables&quot;: {
&quot;path&quot;: {
&quot;type&quot;: &quot;String&quot;
},
&quot;flag&quot;: {
&quot;type&quot;: &quot;String&quot;
},
&quot;values&quot;: {
&quot;type&quot;: &quot;Array&quot;
},
&quot;max_val&quot;: {
&quot;type&quot;: &quot;String&quot;
},
&quot;tp&quot;: {
&quot;type&quot;: &quot;String&quot;
}
},
&quot;annotations&quot;: []
}
}

The following is the pipeline JSON for pipeline2:

{
&quot;name&quot;: &quot;pipeline2&quot;,
&quot;properties&quot;: {
&quot;activities&quot;: [
{
&quot;name&quot;: &quot;make array of values&quot;,
&quot;type&quot;: &quot;ForEach&quot;,
&quot;dependsOn&quot;: [],
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;items&quot;: {
&quot;value&quot;: &quot;@pipeline().parameters.array_to_find_max&quot;,
&quot;type&quot;: &quot;Expression&quot;
},
&quot;isSequential&quot;: true,
&quot;activities&quot;: [
{
&quot;name&quot;: &quot;append each value&quot;,
&quot;type&quot;: &quot;AppendVariable&quot;,
&quot;dependsOn&quot;: [],
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;values&quot;,
&quot;value&quot;: {
&quot;value&quot;: &quot;@int(item().name)&quot;,
&quot;type&quot;: &quot;Expression&quot;
}
}
}
]
}
},
{
&quot;name&quot;: &quot;return max&quot;,
&quot;type&quot;: &quot;SetVariable&quot;,
&quot;dependsOn&quot;: [
{
&quot;activity&quot;: &quot;make array of values&quot;,
&quot;dependencyConditions&quot;: [
&quot;Succeeded&quot;
]
}
],
&quot;policy&quot;: {
&quot;timeout&quot;: &quot;0.12:00:00&quot;,
&quot;retry&quot;: 0,
&quot;retryIntervalInSeconds&quot;: 30,
&quot;secureOutput&quot;: false,
&quot;secureInput&quot;: false
},
&quot;userProperties&quot;: [],
&quot;typeProperties&quot;: {
&quot;variableName&quot;: &quot;pipelineReturnValue&quot;,
&quot;value&quot;: [
{
&quot;key&quot;: &quot;max_val&quot;,
&quot;value&quot;: {
&quot;type&quot;: &quot;Expression&quot;,
&quot;content&quot;: &quot;@if(equals(length(string(max(variables(&#39;values&#39;)))),1),concat(&#39;0&#39;,string(max(variables(&#39;values&#39;)))),string(max(variables(&#39;values&#39;))))&quot;
}
}
],
&quot;setSystemVariable&quot;: true
}
}
],
&quot;parameters&quot;: {
&quot;array_to_find_max&quot;: {
&quot;type&quot;: &quot;array&quot;,
&quot;defaultValue&quot;: [
{
&quot;name&quot;: &quot;2022&quot;,
&quot;type&quot;: &quot;Folder&quot;
},
{
&quot;name&quot;: &quot;2023&quot;,
&quot;type&quot;: &quot;Folder&quot;
}
]
}
},
&quot;variables&quot;: {
&quot;values&quot;: {
&quot;type&quot;: &quot;Array&quot;
},
&quot;max_val&quot;: {
&quot;type&quot;: &quot;String&quot;
},
&quot;tp&quot;: {
&quot;type&quot;: &quot;String&quot;
}
},
&quot;annotations&quot;: []
}
}

The following is the dataset configuration that I used for get metadata activity. The initial value of path in my case is data/f1/ff1 and its value would be updated (greatest folder name would be concatenated):

获取Azure数据工厂子文件夹中的所有文件夹名称。

When I run this pipeline, I get the desired results. After the until loop stops, the variable path has the required path i.e., the path to latest data:

获取Azure数据工厂子文件夹中的所有文件夹名称。

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取Azure数据工厂子文件夹中的所有文件夹名称。

问题

答案1

APIM命名值创建或更新REST API不起作用。

将表情符号添加到从数据工厂发送到逻辑应用的Gmail主题行。

根据内存扩展 Azure 容器应用程序。

如何从外部应用程序管理Azure App角色

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论