Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

huangapple go评论61阅读模式
英文:

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

问题

Recently, Databricks deprecated the DBFS init script, and I attempted to configure a Databricks linked service with InitScripts from ABFSS in ADF. However, I encountered a "file not found" error.

The new cluster configuration is as follows:

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

I was able to achieve the desired result in Databricks because there is an option to configure the init script location type:

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

However, in ADF, I couldn't find a similar option:

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

Please assist me in resolving this issue. I need to create a new Databricks cluster with an init script read from Azure Blob Storage (ABFSS) for every pipeline execution.

英文:

Recently databricks depreciated DBFS init script, And I tried to set up databricks linked service with InitScripts from abfss in ADF, I am getting a file not found error.

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

The new cluster configuration is like below

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

But same tried from Databricks its worked, because we have option to configure which is init script location type.

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

But in ADF , I couldn't see any option to do the same ,

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

Please help me on this to resolve the issue.

I need to create new databricks cluster with init script read from azure blob storage(abfss) for every pipeline excecution.

答案1

得分: 1

  • 你可以使用 REST 调用创建集群,根据需要使用 abfss 进行初始化脚本,然后直接在 Databricks 笔记本中使用此集群。

  • 你可以使用 "we" 活动来调用 "Clusters 2.0" REST API 来创建一个集群,如 此文档 中所指定的,需要在身份验证头部指定 bearer token(访问令牌)。以下是你可以使用的请求体(你可能还需要添加集群配置):

{
    "num_workers": null,
    "autoscale": {
        "min_workers": 2,
        "max_workers": 8
    },
    "cluster_name": "cluster1",
    "spark_version": "7.3.x-scala2.12",
    "spark_conf": {},
    "node_type_id": "Standard_D3_v2",
    "custom_tags": {},
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "autotermination_minutes": 120,
    "init_scripts": "<script_path_here_as_specified_in_document>"
}
  • 使用等待活动直到集群创建完成。我已经将等待时间设置为 300 秒。

  • 最后,使用 web 活动返回的 cluster_id 来运行新创建的集群中的笔记本,如 @activity('Web1').output.cluster_id

英文:
  • You can use the rest call to create the cluster with init script as required with abfss and then use this cluster in the databricks notebook directly.
  • You can use the we activity to call the Clusters 2.0 REST API to create a cluster as specified in this document with authentication header where you specify bearer token (access token). The following is the body that you can use (You might have to add clusted configuration as well):
{
    &quot;num_workers&quot;: null,
    &quot;autoscale&quot;: {
        &quot;min_workers&quot;: 2,
        &quot;max_workers&quot;: 8
    },
    &quot;cluster_name&quot;: &quot;cluster1&quot;,
    &quot;spark_version&quot;: &quot;7.3.x-scala2.12&quot;,
    &quot;spark_conf&quot;: {},
    &quot;node_type_id&quot;: &quot;Standard_D3_v2&quot;,
    &quot;custom_tags&quot;: {},
    &quot;spark_env_vars&quot;: {
        &quot;PYSPARK_PYTHON&quot;: &quot;/databricks/python3/bin/python3&quot;
    },
    &quot;autotermination_minutes&quot;: 120,
    &quot;init_scripts&quot;: &lt;script_path_here_as_specified_in_document&gt;
}

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

  • Use a wait activity until the cluster is created. I have used wait time as 300 seconds.

  • Finally, use the cluster_id returned by the web activity for newly created cluster to run the notebook as @activity(&#39;Web1&#39;).output.cluster_id.

Create new databricks cluster from ADF linked service with InitScripts from abfss (azure blob )

答案2

得分: 0

  • spark.hadoop.fs.azure.account.auth.type 设置为 OAuth
  • spark.hadoop.fs.azure.account.oauth.provider.type 设置为 org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
  • spark.hadoop.fs.azure.account.oauth2.client.endpoint 设置为 https://login.microsoftonline.com/<tenant_id>/oauth2/token
  • spark.hadoop.fs.azure.account.oauth2.client.id 设置为 <client_id>
  • spark.hadoop.fs.azure.account.oauth2.client.secret 设置为 {{secrets/<secret-scope>/<secret-key-name>}}(它将从包含客户端密钥的秘密范围中获取密钥)
英文:

If you specified init script on abfss://... then you need to specify corresponding spark.hadoop.fs... configurations in the "Cluster Spark conf" section (fill the data in &lt;...&gt; with corresponding values):

  • spark.hadoop.fs.azure.account.auth.type set to OAuth
  • spark.hadoop.fs.azure.account.oauth.provider.type set to org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
  • spark.hadoop.fs.azure.account.oauth2.client.endpoint set to https://login.microsoftonline.com/&lt;tenant_id&gt;/oauth2/token
  • spark.hadoop.fs.azure.account.oauth2.client.id set to &lt;client_id&gt;
  • spark.hadoop.fs.azure.account.oauth2.client.secret set to {{secrets/&lt;secret-scope&gt;/&lt;secret-key-name&gt;}} (it will fetch the key from the secret scope that contains client secret)

huangapple
  • 本文由 发表于 2023年5月10日 23:27:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220211.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定