英文:
Is it possible to read hdfs files from within executor
问题
I have a simple spark application to illustrate my question, I would like to read the hdfs files within the mapPartitions
operator, using SparkContext.textFile
, so that I could read the files in every partition and use that to work with partitionIter
.
It looks like I can't use SparkContext. Then, what could I do to achieve my purpose: hdfs files work with partitionIter.
英文:
I have a simple spark application to illustate my question, I would like to read the hdfs files within mapPartitions
operator,using SparkContext.textFile
, so that, I could read the files in every partition and use that to work with partitionIter
It looks that I can't use SparkContext? Then, What could I do to achieve my purpose: hdfs files work with partitionIter.
object SparkTest2 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkTest")
val sc = new SparkContext(conf)
val rdd = sc.textFile("test1")
rdd.mapPartitions {
partitionIter => {
//Read from HDFS for each partition
//Is it possible to read hdfs files from within executor
Seq("a").toIterator
}
}.collect()
}
}
答案1
得分: 2
IMHO: 通常情况下,使用标准方式(在驱动程序上读取并使用Spark函数传递给执行器)在操作上比以非标准方式进行操作要容易得多。因此,在这种情况下(具有有限的细节),将文件在驱动程序上读取为数据框然后进行连接。
That said have you tried using --files
option for your spark-submit
(or pyspark
):
--files FILES 在每个执行器的工作目录中放置要用逗号分隔的文件列表。这些文件在执行器中的文件路径可以通过SparkFiles.get(fileName)访问。
英文:
IMHO: Usually using the standard way (read on driver and pass to executors using spark functions) is much easier operationally then doing things in a non-standard way. So in this case (with limited details) read the files on driver as dataframe and join with it.
That said have you tried using --files
option for your spark-submit
(or pyspark
):
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论