英文:
do we need to install spark on yarn to read data from HDFS into Py Spark?
问题
- 我有一个Hadoop 3.1.1多节点集群,我想利用PySpark从我的HDFS读取文件到PySpark进行ETL操作,然后加载到目标MySQL数据库。
以下是要求:
- 我可以在独立模式下安装Spark吗?
- 我需要先在我的YARN上安装Spark吗?
- 如果不需要,在哪里可以单独安装Spark?
英文:
I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.
Given below is the ask.
- can I install spark in standalone mode?
- do I need to install spark on my yarn first?
- if no, how can I install spark separately?
答案1
得分: 1
你可以使用任何模式与HDFS和MySQL进行通信,包括Kubernetes。或者,你只需使用--master="local[*]"
,根本不需要调度程序。例如,从Jupyter Notebook中非常有用。
由于你已经有HDFS,因此建议使用YARN,并且已经有启动YARN进程的脚本。
你实际上不需要"在YARN上安装Spark"。来自客户端的应用程序会提交到YARN集群。spark.yarn.archives
HDFS路径将被解压缩为运行作业所需的类。
请参考https://spark.apache.org/docs/latest/running-on-yarn.html。
英文:
You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]"
and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.
YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.
You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives
HDFS path will get unpacked into the classes necessary to run the job.
Refer https://spark.apache.org/docs/latest/running-on-yarn.html
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论