问题

我有一个Hadoop 3.1.1多节点集群，我想利用PySpark从我的HDFS读取文件到PySpark进行ETL操作，然后加载到目标MySQL数据库。

以下是要求：

我可以在独立模式下安装Spark吗？
我需要先在我的YARN上安装Spark吗？
如果不需要，在哪里可以单独安装Spark？

英文:

I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.

Given below is the ask.

can I install spark in standalone mode?
do I need to install spark on my yarn first?
if no, how can I install spark separately?

答案1

得分: 1

你可以使用任何模式与HDFS和MySQL进行通信，包括Kubernetes。或者，你只需使用--master="local[*]"，根本不需要调度程序。例如，从Jupyter Notebook中非常有用。

由于你已经有HDFS，因此建议使用YARN，并且已经有启动YARN进程的脚本。

你实际上不需要"在YARN上安装Spark"。来自客户端的应用程序会提交到YARN集群。spark.yarn.archives HDFS路径将被解压缩为运行作业所需的类。

请参考https://spark.apache.org/docs/latest/running-on-yarn.html。

英文:

You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]" and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.

YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.

You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives HDFS path will get unpacked into the classes necessary to run the job.

Refer https://spark.apache.org/docs/latest/running-on-yarn.html

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

需要在YARN上安装Spark才能从HDFS读取数据到PySpark吗？

问题

答案1

ValueError: 在重构 PySpark 以与 Snowpark 兼容时未找到子字符串

如何删除在特定子字符串之后有文本的行？

更新具有空值的嵌套结构。

如何在Scala的字符串列中将\"替换为"

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论