英文:
How to set up authorization of Delta Live Tables to access Azure Data Lake files?
问题
我正在编写SQL中的Delta Live Tables笔记本,以访问数据湖中的文件,类似于以下内容:
CREATE OR REFRESH STREAMING LIVE TABLE MyTable
AS SELECT * FROM cloud_files("DataLakeSource/MyTableFiles", "parquet", map("cloudFiles.inferColumnTypes", "true"))
每当我需要访问Azure数据湖时,通常会像这样设置访问:
service_credential = dbutils.secrets.get(scope="myscope", key="mykey")
spark.conf.set("fs.azure.account.auth.type.mylake.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.mylake.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.MyLake.dfs.core.windows.net", "99999999-9999-9999-9999-999999999")
spark.conf.set("fs.azure.account.oauth2.client.secret.mylake.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.mylake.dfs.core.windows.net", "https://login.microsoftonline.com/99999999-9999-9999-9999-9999999999/oauth2/token")
由于在SQL Delta Live Table笔记本中无法添加像上面的Python单元格来设置访问权限,那么如何/在哪里添加访问数据湖文件的配置?
我考虑将配置信息添加到管道的配置部分,但这当然无法与dbutils.secrets.get
的调用一起使用。
英文:
I am writing delta live tables notebooks in sql to access files from the data lake something like this:
CREATE OR REFRESH STREAMING LIVE TABLE MyTable
AS SELECT * FROM cloud_files("DataLakeSource/MyTableFiles", "parquet", map("cloudFiles.inferColumnTypes", "true"))
Whenever I need to access the Azure Data Lake I usually do something like this to set up the access:
service_credential = dbutils.secrets.get(scope="myscope",key="mykey")
spark.conf.set("fs.azure.account.auth.type.mylake.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.mylake.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.MyLake.dfs.core.windows.net", "99999999-9999-9999-9999-999999999")
spark.conf.set("fs.azure.account.oauth2.client.secret.mylake.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.mylake.dfs.core.windows.net", "https://login.microsoftonline.com/99999999-9999-9999-9999-9999999999/oauth2/token")
Since I can't add a python cell like I have above to set up the access inside a sql delta live table notebook, how/where do I add the configuration for access to the data lake files?
I've thought about adding config info to the pipeline under configuration, but that of course won't work with the call to dbutils.secrets.get
答案1
得分: 1
你可以创建一个单独的笔记本,其中包含连接信息,并首先调用它,然后再调用 SQL delta live table 笔记本。
英文:
You can create a separate notebook with the connection information in it and call it first, then call the SQL delta live table notebook.
答案2
得分: 0
-
创建 Delta Live Tables 流水线时,请使用两个笔记本:
- 具有
CREATE OR REFRESH STREAMING LIVE TABLE MyTable
定义的 SQL 笔记本 - 具有
service_credential
和fs.azure.account
属性的 Python 笔记本
- 具有
DLT 运行时应该能够解析笔记本的顺序并启动授权。
Alex Ott 的评论似乎是正确的:
你需要将此配置作为流水线定义的一部分提供。
这两个笔记本之间没有依赖关系(一个是 SQL,另一个是包含 spark.conf.set
或甚至 SET
的 Python 笔记本),因此 DLT 运行时无法选择一个笔记本优先执行,因此设置属性。
更有趣的是(在回答这个问题时,我并不是真正了解这一点),在Delta Live Tables 配置管道设置中找到了以下信息:
您可以使用 UI 或 JSON 规范配置大多数设置。一些高级选项只能使用 JSON 配置。
然后在配置计算设置中:
Delta Live Tables UI 中的计算设置主要针对用于管道更新的默认集群。如果选择指定需要用于数据访问的凭据的存储位置,必须确保维护集群也配置了这些权限。
Delta Live Tables 提供与 Databricks 上的其他计算相似的集群设置选项。与其他流水线设置一样,您可以修改集群的 JSON 配置,以指定 UI 中不存在的选项。
换句话说,您必须使用 Delta Live Tables API 或类似的工具(如 Databricks Terraform 提供程序)来访问 与集群相关的设置。
使用实例配置文件配置 S3 访问
另一个选项是使用实例配置文件配置 S3 访问,它要求您“在包含您的 Databricks 工作区的 AWS 帐户中拥有足够的特权,并且是 Databricks 工作区管理员”。
英文:
When creating your Delta Live Tables pipeline use two notebooks:
- The SQL notebook with
CREATE OR REFRESH STREAMING LIVE TABLE MyTable
definition - The Python notebook with the
service_credential
andfs.azure.account
properties
The DLT runtime should be able to resolve the order of the notebooks and fire up authorization.
Alex Ott's comment seems correct:
> You need to provide this configuration as part of the pipeline definition.
There'd be no dependency between the two notebooks (one with SQL and the other with spark.conf.set
s or even SET
s, so the DLT runtime couldn't choose one over the other as the first to execute and hence set the properties.
What's even more interesting (that I didn't really know about while answering this question) is the following (found in Configure pipeline settings for Delta Live Tables):
> You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.
And then in Configure your compute settings:
> Compute settings in the Delta Live Tables UI primarily target the default cluster used for pipeline updates. If you choose to specify a storage location that requires credentials for data access, you must ensure that the maintenance cluster also has these permissions configured.
> Delta Live Tables provides similar options for cluster settings as other compute on Databricks. Like other pipeline settings, you can modify the JSON configuration for clusters to specify options not present in the UI
In other words, you have to use Delta Live Tables API or alike (Databricks Terraform provider) that gives you access to cluster-related settings.
Configure S3 access with instance profiles
Another option seems Configure S3 access with instance profiles that requires that you "have sufficient privileges in the AWS account containing your Databricks workspace, and be a Databricks workspace administrator."
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论