需要在YARN上安装Spark才能从HDFS读取数据到PySpark吗?

huangapple go评论97阅读模式
英文:

do we need to install spark on yarn to read data from HDFS into Py Spark?

问题

  • 我有一个Hadoop 3.1.1多节点集群,我想利用PySpark从我的HDFS读取文件到PySpark进行ETL操作,然后加载到目标MySQL数据库。

以下是要求:

  • 我可以在独立模式下安装Spark吗?
  • 我需要先在我的YARN上安装Spark吗?
  • 如果不需要,在哪里可以单独安装Spark?
英文:

I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.

Given below is the ask.

  • can I install spark in standalone mode?
  • do I need to install spark on my yarn first?
  • if no, how can I install spark separately?

答案1

得分: 1

你可以使用任何模式与HDFS和MySQL进行通信,包括Kubernetes。或者,你只需使用--master="local[*]",根本不需要调度程序。例如,从Jupyter Notebook中非常有用。

由于你已经有HDFS,因此建议使用YARN,并且已经有启动YARN进程的脚本。

你实际上不需要"在YARN上安装Spark"。来自客户端的应用程序会提交到YARN集群。spark.yarn.archives HDFS路径将被解压缩为运行作业所需的类。

请参考https://spark.apache.org/docs/latest/running-on-yarn.html。

英文:

You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]" and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.

YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.

You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives HDFS path will get unpacked into the classes necessary to run the job.

Refer https://spark.apache.org/docs/latest/running-on-yarn.html

huangapple
  • 本文由 发表于 2023年1月6日 11:58:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026796.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定