问题

I have a simple spark application to illustrate my question, I would like to read the hdfs files within the mapPartitions operator, using SparkContext.textFile, so that I could read the files in every partition and use that to work with partitionIter.

It looks like I can't use SparkContext. Then, what could I do to achieve my purpose: hdfs files work with partitionIter.

英文:

I have a simple spark application to illustate my question, I would like to read the hdfs files within mapPartitions operator,using SparkContext.textFile, so that, I could read the files in every partition and use that to work with partitionIter

It looks that I can't use SparkContext? Then, What could I do to achieve my purpose: hdfs files work with partitionIter.

object SparkTest2 {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName(&quot;SparkTest&quot;)
    val sc = new SparkContext(conf)
    val rdd = sc.textFile(&quot;test1&quot;)
    rdd.mapPartitions {
      partitionIter =&gt; {
        //Read from HDFS for each partition
        //Is it possible to read hdfs files from within executor
        Seq(&quot;a&quot;).toIterator
      }
    }.collect()

  }
}

答案1

得分: 2

IMHO: 通常情况下，使用标准方式（在驱动程序上读取并使用Spark函数传递给执行器）在操作上比以非标准方式进行操作要容易得多。因此，在这种情况下（具有有限的细节），将文件在驱动程序上读取为数据框然后进行连接。

That said have you tried using --files option for your spark-submit (or pyspark):

--files FILES     在每个执行器的工作目录中放置要用逗号分隔的文件列表。这些文件在执行器中的文件路径可以通过SparkFiles.get(fileName)访问。

英文:

IMHO: Usually using the standard way (read on driver and pass to executors using spark functions) is much easier operationally then doing things in a non-standard way. So in this case (with limited details) read the files on driver as dataframe and join with it.

That said have you tried using --files option for your spark-submit (or pyspark):

--files FILES     Comma-separated list of files to be placed in the working
                  directory of each executor. File paths of these files
                  in executors can be accessed via SparkFiles.get(fileName).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

可以从执行程序内部读取HDFS文件吗？

问题

答案1

在Spark DataFrame中展开具有不同模式的嵌套结构

禁用双引号

配置使用Maven和logback进行Apache Spark日志记录，并最终将消息传递到Loggly。

MicroBatchExecution: Query terminated with error UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论