2023年5月10日 23:27:03go评论91阅读模式

英文:

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

问题

Recently, Databricks deprecated the DBFS init script, and I attempted to configure a Databricks linked service with InitScripts from ABFSS in ADF. However, I encountered a "file not found" error.

The new cluster configuration is as follows:

I was able to achieve the desired result in Databricks because there is an option to configure the init script location type:

However, in ADF, I couldn't find a similar option:

Please assist me in resolving this issue. I need to create a new Databricks cluster with an init script read from Azure Blob Storage (ABFSS) for every pipeline execution.

英文:

Recently databricks depreciated DBFS init script, And I tried to set up databricks linked service with InitScripts from abfss in ADF, I am getting a file not found error.

The new cluster configuration is like below

But same tried from Databricks its worked, because we have option to configure which is init script location type.

But in ADF , I couldn't see any option to do the same ,

Please help me on this to resolve the issue.

I need to create new databricks cluster with init script read from azure blob storage(abfss) for every pipeline excecution.

答案1

得分: 1

你可以使用 REST 调用创建集群，根据需要使用 abfss 进行初始化脚本，然后直接在 Databricks 笔记本中使用此集群。
你可以使用 "we" 活动来调用 "Clusters 2.0" REST API 来创建一个集群，如此文档中所指定的，需要在身份验证头部指定 bearer token（访问令牌）。以下是你可以使用的请求体（你可能还需要添加集群配置）：

{
    "num_workers": null,
    "autoscale": {
        "min_workers": 2,
        "max_workers": 8
    },
    "cluster_name": "cluster1",
    "spark_version": "7.3.x-scala2.12",
    "spark_conf": {},
    "node_type_id": "Standard_D3_v2",
    "custom_tags": {},
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "autotermination_minutes": 120,
    "init_scripts": "<script_path_here_as_specified_in_document>"
}

使用等待活动直到集群创建完成。我已经将等待时间设置为 300 秒。
最后，使用 web 活动返回的 cluster_id 来运行新创建的集群中的笔记本，如 @activity('Web1').output.cluster_id。

英文:

You can use the rest call to create the cluster with init script as required with abfss and then use this cluster in the databricks notebook directly.
You can use the we activity to call the Clusters 2.0 REST API to create a cluster as specified in this document with authentication header where you specify bearer token (access token). The following is the body that you can use (You might have to add clusted configuration as well):

{
    &quot;num_workers&quot;: null,
    &quot;autoscale&quot;: {
        &quot;min_workers&quot;: 2,
        &quot;max_workers&quot;: 8
    },
    &quot;cluster_name&quot;: &quot;cluster1&quot;,
    &quot;spark_version&quot;: &quot;7.3.x-scala2.12&quot;,
    &quot;spark_conf&quot;: {},
    &quot;node_type_id&quot;: &quot;Standard_D3_v2&quot;,
    &quot;custom_tags&quot;: {},
    &quot;spark_env_vars&quot;: {
        &quot;PYSPARK_PYTHON&quot;: &quot;/databricks/python3/bin/python3&quot;
    },
    &quot;autotermination_minutes&quot;: 120,
    &quot;init_scripts&quot;: &lt;script_path_here_as_specified_in_document&gt;
}

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

Use a wait activity until the cluster is created. I have used wait time as 300 seconds.
Finally, use the cluster_id returned by the web activity for newly created cluster to run the notebook as @activity('Web1').output.cluster_id.

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

答案2

得分: 0

spark.hadoop.fs.azure.account.auth.type 设置为 OAuth
spark.hadoop.fs.azure.account.oauth.provider.type 设置为 org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.endpoint 设置为 https://login.microsoftonline.com/<tenant_id>/oauth2/token
spark.hadoop.fs.azure.account.oauth2.client.id 设置为 <client_id>
spark.hadoop.fs.azure.account.oauth2.client.secret 设置为 {{secrets/<secret-scope>/<secret-key-name>}}（它将从包含客户端密钥的秘密范围中获取密钥）

英文:

If you specified init script on abfss://... then you need to specify corresponding spark.hadoop.fs... configurations in the "Cluster Spark conf" section (fill the data in <...> with corresponding values):

spark.hadoop.fs.azure.account.auth.type set to OAuth
spark.hadoop.fs.azure.account.oauth.provider.type set to org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.endpoint set to https://login.microsoftonline.com/<tenant_id>/oauth2/token
spark.hadoop.fs.azure.account.oauth2.client.id set to <client_id>
spark.hadoop.fs.azure.account.oauth2.client.secret set to {{secrets/<secret-scope>/<secret-key-name>}} (it will fetch the key from the secret scope that contains client secret)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

问题

答案1

答案2

使用pyspark从Azure文件共享中读取数据。

How can I load several CSVs with the same columns but in different column orders into a single table using Azure Data Factory?

why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf

DotNetZip密码不起作用或不正确。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论