如何在Databricks作业集群中安装Maven库

huangapple go评论70阅读模式
英文:

How to install maven library in databricks job cluster

问题

Our databricks application is using multiple python and maven packages. Using the UI we can install the maven packages without any issue. But as we are going to use job cluster we are finding it difficult to install it.

我们的 Databricks 应用程序使用了多个 Python 和 Maven 包。使用界面,我们可以轻松安装 Maven 包,没有任何问题。但是,因为我们将要使用作业集群,我们发现安装它变得困难。

Our job is schedule via ADF, one option is to add the library in the ADF pipeline but that will be too many changes on ADF side and we don't want to do this.

我们的作业是通过 ADF 计划的,一种选择是将库添加到 ADF 管道中,但这将在 ADF 方面进行太多更改,我们不想这样做。

We would like to call a single notebook which will install all the required libraries. We are able to install the python libraries but having issue while installing maven libraries.

我们想调用一个单一的笔记本,它将安装所有所需的库。我们能够安装 Python 库,但在安装 Maven 库时遇到问题。

Any help will be really helpful.

任何帮助都将非常有帮助。

英文:

Our databricks applicaiton is using multiple python and maven packages. Using the UI we can install the maven packages without any issue. But as we are going to use job cluster we are finding it difficult to install it.

Our job is schedule via ADF, one option is to add the library in the ADF pipeline but that will be too many changes on ADF side and we don't want to do this.

We would like to call a single notebook which will install all the required libraries. We are able to install the python libraries but having issue while installing maven libraries.

Any help will be really helpfull.

答案1

得分: 1

我不知道通过笔记本安装Maven库的方法,但有其他方法可以实现您的需求。有三种选择可以做到这一点。

  1. 工作区级别的全局初始化脚本。这会影响工作区中的所有群集。
  2. 本地化到链接服务的初始化脚本。因此,需要相同库的所有ADF管道可以调用相同的链接服务,如果笔记本位于相同的ADB工作区上。
  3. 可以在单个管道级别列出库。这提供了很大的灵活性,但需要在每个管道中包含库的列表,这些库是在全局和链接服务脚本中提到的库之外安装的。

由于您希望对ADF进行最少的更改,我建议选择选项1或2。
对于选项1:
maven存储库下载maven jar文件。
工作区的管理员需要通过管理控制台启用dbfs浏览,如下所示。这是在Admin Settings中的Workspace settings下完成的。
确保DBFS文件浏览选项设置为已启用,如下所示。
接下来,在/dbfs/Filestore/tables/下创建一个名为jars的新文件夹,路径将为/dbfs/FileStore/tables/jars。
点击左侧面板上的“Data”选项卡,然后点击“Browse DBFS”按钮,然后点击“Upload”按钮。
确保DBFS目标目录设置为/FileStore/tables/jars,如下所示。
然后将jar文件拖放到提供的框中,然后点击“Done”。

我们的初始化脚本应该具有.sh后缀(示例:init1.sh),并应具有以下内容:

#!/bin/bash
pip install msal
cp /dbfs/FileStore/tables/jars/spark_mssql_connector_2_12_1_2_0.jar /databricks/jars/

以上脚本将安装一个Python库,并将Maven jar复制到/databricks/jars/文件夹中。

将脚本加载为全局初始化脚本。您可以参考此链接了解如何操作。

现在,每个群集、交互式或作业都将使用相同的库。

英文:

So I'm not aware of a wait to install maven libraries via a notebook but there are other ways to achieve what you need.
There are 3 options how you can do this.

  1. A global init script at the workspace level. This impacts all
    clusters in the workspace.
  2. An init script localized to a linked service. So all ADF pipelines that need the
    same libraries can call the same linked service if the notebooks are on the same
    ADB workspace.
  3. Libraries can be listed out at the individual pipeline level. This provides a
    lot of flexibility but additional effort to include the list of libraries in
    each and every pipeline. This are installed over and above the libraries
    mentioned in the global and Linked service scripts.

Since you want minimum changes to ADF, I would suggest to go with option 1 or 2.
For option 1 :
Download the maven jar from the maven repo.
The admin of the workspace needs to enable dbfs browsing via the admin console like illustrated below. This is done under the Workspace settings within the Admin Settings.
如何在Databricks作业集群中安装Maven库
Make sure the DBFS File browsing option is set to enabled as shown below
如何在Databricks作业集群中安装Maven库
Next we create a new folder within /dbfs/Filestore/tables/ called jars so the path would be /dbfs/FileStore/tables/jars
Click the 'Data' tab on the left panel, and within that click the 'Browse DBFS' button and then the 'Upload' button.
Make sure the DBFS target directory is set to /FileStore/tables/jars as shown below.
Then drag and drop the jar into the box provided and click on Done.
如何在Databricks作业集群中安装Maven库

Our init script should have a .sh suffix (example : init1.sh) and should have the below contents :

#!/bin/bash
pip install msal
cp /dbfs/FileStore/tables/jars/spark_mssql_connector_2_12_1_2_0.jar /databricks/jars/

The above script will install a python library as well as copy the maven jar to the /databricks/jars/ folder.

Load the script as a Global init script . You can refer to this link on how to do that.

Now, ever cluster, interactive or job will use the same set of libraries.

huangapple
  • 本文由 发表于 2023年5月22日 18:37:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76305319.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定