可以从执行程序内部读取HDFS文件吗?

huangapple go评论54阅读模式
英文:

Is it possible to read hdfs files from within executor

问题

I have a simple spark application to illustrate my question, I would like to read the hdfs files within the mapPartitions operator, using SparkContext.textFile, so that I could read the files in every partition and use that to work with partitionIter.

It looks like I can't use SparkContext. Then, what could I do to achieve my purpose: hdfs files work with partitionIter.

英文:

I have a simple spark application to illustate my question, I would like to read the hdfs files within mapPartitions operator,using SparkContext.textFile, so that, I could read the files in every partition and use that to work with partitionIter

It looks that I can't use SparkContext? Then, What could I do to achieve my purpose: hdfs files work with partitionIter.

object SparkTest2 {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SparkTest")
    val sc = new SparkContext(conf)
    val rdd = sc.textFile("test1")
    rdd.mapPartitions {
      partitionIter => {
        //Read from HDFS for each partition
        //Is it possible to read hdfs files from within executor
        Seq("a").toIterator
      }
    }.collect()

  }
}

答案1

得分: 2

IMHO: 通常情况下,使用标准方式(在驱动程序上读取并使用Spark函数传递给执行器)在操作上比以非标准方式进行操作要容易得多。因此,在这种情况下(具有有限的细节),将文件在驱动程序上读取为数据框然后进行连接。

That said have you tried using --files option for your spark-submit (or pyspark):

--files FILES     在每个执行器的工作目录中放置要用逗号分隔的文件列表。这些文件在执行器中的文件路径可以通过SparkFiles.get(fileName)访问。
英文:

IMHO: Usually using the standard way (read on driver and pass to executors using spark functions) is much easier operationally then doing things in a non-standard way. So in this case (with limited details) read the files on driver as dataframe and join with it.

That said have you tried using --files option for your spark-submit (or pyspark):

--files FILES     Comma-separated list of files to be placed in the working
                  directory of each executor. File paths of these files
                  in executors can be accessed via SparkFiles.get(fileName).

huangapple
  • 本文由 发表于 2023年4月13日 21:30:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76006046.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定