英文:
Get all folder names in subfolders Azure Data factory
问题
我在Data Lake中有以下的文件夹结构;
datasetname/fullload/year/month/day/hour/min/sec/data
我无法创建Azure函数或Databricks,只能使用简单的ADF活动。
我想要从我的ParentFolder目录(datasetname/fullload)的所有子文件夹中获取最新的文件夹名称。我尝试了使用GetMetadata -> 设置变量然后循环,但仍然不起作用。
我需要获取Blob存储中文件夹的最新路径。
谢谢。
英文:
I have a below Folder Structure in Data Lake;
datasetname/fullload/year/month/day/hour/min/sec/data
I can't create Azure function or databrick. just simple adf activity
I want to get the latest folder names from all subfolders of my ParentFolder directory (datasetname/fullload) . I tried GetMetadata -> set variable then loop but still not working
I need to get the latest path of the folder in the blob storage
Thanks
答案1
得分: 1
以下是翻译好的部分:
-
由于您需要最新的数据,而文件夹大多是数字,您可以在每个子文件夹中找到最大的数字,以找到最新的数据。
-
我的文件数据如下图所示:
-
要找到最大的文件夹,我使用了2个流程。
pipeline1
用于迭代并获取子项,直到子项不存在。pipeline2
用于查找特定文件夹中子文件夹名称列表的最大数值。 -
以下是 pipeline1 的管道 JSON:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "get path",
"type": "Until",
...
},
{
"name": "Set flag",
"type": "SetVariable",
...
},
{
"name": "Set path",
"type": "SetVariable",
...
}
],
"variables": {
"path": {
"type": "String"
},
"flag": {
"type": "String"
},
"values": {
"type": "Array"
},
"max_val": {
"type": "String"
},
"tp": {
"type": "String"
}
},
"annotations": []
}
}
- 以下是 pipeline2 的管道 JSON:
{
"name": "pipeline2",
"properties": {
"activities": [
{
"name": "make array of values",
"type": "ForEach",
...
},
{
"name": "return max",
"type": "SetVariable",
...
}
],
"parameters": {
"array_to_find_max": {
"type": "array",
"defaultValue": [
{
"name": "2022",
"type": "Folder"
},
{
"name": "2023",
"type": "Folder"
}
]
}
},
"variables": {
"values": {
"type": "Array"
},
"max_val": {
"type": "String"
},
"tp": {
"type": "String"
}
},
"annotations": []
}
}
- 我用于获取元数据活动的数据集配置如下。在我的情况下,path 的初始值为
data/f1/ff1
,并且其值将被更新(最大文件夹名称将被连接):
- 当我运行这个管道时,我得到了所需的结果。在 until 循环停止后,变量 path 包含所需的路径,即最新数据的路径:
英文:
- Since you need the latest data and the folders are mostly numbers, you can find the greatest number in each sub folder to find the latest data.
- I have file data as shown in the below image:
-
To find the greatest folder, I have used 2 pipelines.
pipeline1
is used to iterate and get child items until child items don't exist.pipeline2
is to find the maximum number for the list of sub-folder names in a particular folder. -
The following is the pipeline JSON for pipeline1:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "get path",
"type": "Until",
"dependsOn": [
{
"activity": "Set flag",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "@equals(variables('flag'),'true')",
"type": "Expression"
},
"activities": [
{
"name": "Get sub folders",
"type": "GetMetadata",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "root",
"type": "DatasetReference",
"parameters": {
"path": {
"value": "@variables('path')",
"type": "Expression"
}
}
},
"fieldList": [
"childItems"
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
}
},
{
"name": "If Condition1",
"type": "IfCondition",
"dependsOn": [
{
"activity": "Get sub folders",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "@greater(length(activity('Get sub folders').output.childItems),0)",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"variableName": "flag",
"value": {
"value": "true",
"type": "Expression"
}
}
}
],
"ifTrueActivities": [
{
"name": "get latest",
"type": "ExecutePipeline",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"pipeline": {
"referenceName": "pipeline2",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"array_to_find_max": {
"value": "@activity('Get sub folders').output.childItems",
"type": "Expression"
}
}
}
},
{
"name": "append max to path",
"type": "SetVariable",
"dependsOn": [
{
"activity": "get latest",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"variableName": "tp",
"value": {
"value": "@{variables('path')}/@{activity('get latest').output.pipelineReturnValue.max_val}",
"type": "Expression"
}
}
},
{
"name": "update path",
"type": "SetVariable",
"dependsOn": [
{
"activity": "append max to path",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"variableName": "path",
"value": {
"value": "@variables('tp')",
"type": "Expression"
}
}
}
]
}
}
],
"timeout": "0.12:00:00"
}
},
{
"name": "Set flag",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Set path",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"variableName": "flag",
"value": {
"value": "false",
"type": "Expression"
}
}
},
{
"name": "Set path",
"type": "SetVariable",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"variableName": "path",
"value": {
"value": "data/f1/ff1",
"type": "Expression"
}
}
}
],
"variables": {
"path": {
"type": "String"
},
"flag": {
"type": "String"
},
"values": {
"type": "Array"
},
"max_val": {
"type": "String"
},
"tp": {
"type": "String"
}
},
"annotations": []
}
}
- The following is the pipeline JSON for pipeline2:
{
"name": "pipeline2",
"properties": {
"activities": [
{
"name": "make array of values",
"type": "ForEach",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@pipeline().parameters.array_to_find_max",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "append each value",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "values",
"value": {
"value": "@int(item().name)",
"type": "Expression"
}
}
}
]
}
},
{
"name": "return max",
"type": "SetVariable",
"dependsOn": [
{
"activity": "make array of values",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"variableName": "pipelineReturnValue",
"value": [
{
"key": "max_val",
"value": {
"type": "Expression",
"content": "@if(equals(length(string(max(variables('values')))),1),concat('0',string(max(variables('values')))),string(max(variables('values'))))"
}
}
],
"setSystemVariable": true
}
}
],
"parameters": {
"array_to_find_max": {
"type": "array",
"defaultValue": [
{
"name": "2022",
"type": "Folder"
},
{
"name": "2023",
"type": "Folder"
}
]
}
},
"variables": {
"values": {
"type": "Array"
},
"max_val": {
"type": "String"
},
"tp": {
"type": "String"
}
},
"annotations": []
}
}
- The following is the dataset configuration that I used for get metadata activity. The initial value of path in my case is
data/f1/ff1
and its value would be updated (greatest folder name would be concatenated):
- When I run this pipeline, I get the desired results. After the until loop stops, the variable path has the required path i.e., the path to latest data:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论