英文:
How do I connect to and write a csv file to a remote instance of Databricks Apache Spark from Java?
问题
我正在尝试连接到远程的 Databricks 实例,并将一个 CSV 文件写入到 DBFS 的特定文件夹中。我能在各个地方找到一些零散的信息,但我还不清楚如何完成这个任务。我应该如何从我的本地机器上运行的 Java 程序将文件添加到远程 Databricks 实例的 DBFS 中呢?
我目前正在使用我从这里创建的社区实例:
https://databricks.com/try-databricks
这是我实例的 URL(我猜想 ""o=7823909094774610"" 是标识我的实例的部分)。
https://community.cloud.databricks.com/?o=7823909094774610
以下是我正在尝试解决这个问题时查阅的一些资源,但我仍然无法入门:
-
Databricks Connect 文档:这些文档讨论了连接,但并没有特别涉及 Java 方面的内容。它提供了一个关于如何将 "Eclipse 连接" 到 Databricks 的示例,似乎是获取此功能的 jar 依赖项的方法(附带一个问题,是否有这个的 Maven 版本?)。https://docs.databricks.com/dev-tools/databricks-connect.html#run-examples-from-your-ide
-
一些 Java 示例代码:似乎没有连接到远程 Databricks 实例的示例。https://www.programcreek.com/java-api-examples/index.php?api=org.apache.spark.sql.SparkSession
-
Databricks 文件系统(DBFS)文档:很好地介绍了文件功能,但似乎没有具体说明如何从远程 Java 应用程序连接并将文件从 Java 应用程序写入到 Databricks 实例。https://docs.databricks.com/data/databricks-file-system.html
-
FileStore 文档:很好地概述了文件存储,但同样似乎没有具体说明如何从远程 Java 应用程序执行此操作。https://docs.databricks.com/data/filestore.html
英文:
I'm trying to connect to a remote instance of Databricks and write a csv file to a specific folder of the DBFS. I can find bits and pieces here and there but I'm not seeing how to get this done. How do I add the file to DBFS on a remote Databricks instance from a Java program running on my local machine?
I'm currently using a community instance I created from here:
https://databricks.com/try-databricks
This is the url for my instance (I'm guessing the "o=7823909094774610" is identifying my instance).
https://community.cloud.databricks.com/?o=7823909094774610
Here's some of the resources I'm looking at trying to resolve this but I'm still not able to get off of the ground:
-
The Databricks Connect documentation: This talks about connecting but
not specifically from Java. It gives and example of "connecting
Eclipse" to data bricks that seems to be how to get the jar
dependency for this (side question, is there a mvn version of this?). https://docs.databricks.com/dev-tools/databricks-connect.html#run-examples-from-your-ide -
Some Java sample code: Doesn't seem to have an example of connecting
to a remote Databricks instance
https://www.programcreek.com/java-api-examples/index.php?api=org.apache.spark.sql.SparkSession -
Databricks File System (DBFS) Documentation: Gives a good overview of
file functions but doesn't seem to talk specifically about how to
connect from a remote Java application and write the file to the
Databricks instance from the Java application
https://docs.databricks.com/data/databricks-file-system.html -
FileStore documentation: Gives a good overview of file store but
again doesn't seem to talk specifically about how to do this from a
remote Java application
https://docs.databricks.com/data/filestore.html
答案1
得分: 2
你可以查看一下DBFS REST API,并考虑在你的Java应用程序中使用它。
如果不需要使用Java解决方案,你还可以查看一下databricks-cli。在使用pip安装它后(pip install databricks-cli
),你只需要:
- 通过运行以下命令配置CLI:
databricks configure
- 主机:https://community.cloud.databricks.com/?o=7823909094774610
- 用户名:<你的用户名>
- 密码:<你的密码>
- 通过运行以下命令将文件复制到DBFS:
databricks fs cp <源路径> dbfs:/<目标路径>
英文:
You could take a look at the DBFS REST API, and consider using that in your Java application.
If a Java solution is not required, then you could also take a look at the databricks-cli. After installing it with pip (pip install databricks-cli
) you simply have to:
- Configure the CLI by running:
databricks configure
- Host: https://community.cloud.databricks.com/?o=7823909094774610
- Username: <your username>
- Password: <your password>
- Copy the file to DBFS by running:
databricks fs cp <source> dbfs:/<target>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论