2020年9月25日 10:37:36go评论130阅读模式

英文:

spark SAVEASTEXTfile is taking lot of time - 1.6.3

问题

从Mongo中提取数据。处理数据，然后将数据存储在HDFS中。

提取和处理100万条记录的时间少于1.1分钟。

提取代码

JavaRDD<Document> rdd = MongoSpark.load(jsc);

处理代码

JavaRDD<String> fullFile = rdd.map(new Function<Document, String>() {
    public String call(Document s) {
        return JsonParsing.returnKeyJson(JsonParsing.returnFlattenMapJson(s.toJson()), args[3].split(","), extractionDetails);
    }
});
System.out.println("Records Downloaded - " + fullFile.count());

这个过程少于1.1分钟。因为在那一点上我获取了RDD的计数。

然后我有一个保存命令，如下所示，

fullFile
 .coalesce(1)
 .saveAsTextFile(args[4], GzipCodec.class);

这需要至少15到20分钟才能保存到HDFS中。

不确定为什么会花那么多时间。
如果有什么可以加快过程的方法，请告诉我。

我正在使用以下选项运行它，
--num-executors 4 --executor-memory 4g --executor-cores 4

如果我增加执行者或内存数量，仍然没有任何区别。
我将分区数设置为70，不确定是否增加这个数量可能会有性能提升？

任何减少保存时间的建议都将很有帮助。

提前致谢

英文:

I Extract Data from Mongo. Process the data and then store the data in HDFS.

Extraction and Processing of 1M records completes is less than 1.1 Minute.

Extraction Code

JavaRDD&lt;Document&gt; rdd = MongoSpark.load(jsc);

Processing Code

              JavaRDD&lt;String&gt; fullFile = rdd.map(new Function&lt;Document, String&gt;() {
                           public String call(Document s) {
//                         System.out.println(&quot; About to Transform Json ----- &quot; + s.toJson());
                            return JsonParsing.returnKeyJson(JsonParsing.returnFlattenMapJson(s.toJson()),args[3].split(&quot;,&quot;),extractionDetails);
                }
         });
System.out.println(&quot;Records Downloaded - &quot; + fullFile.count());

This complete is less than 1.1 Minute. As i fetch the count of RDD at that point.

After that i have Save command which as follows,

  fullFile
   .coalesce(1)
   .saveAsTextFile(args[4], GzipCodec.class);

This takes atleast 15 to 20 min to save it into HDFS.

Not sure why it takes much time.
Let me know if anything can be done to faster the process.

I am using the following options to run it,
--num-executors 4 --executor-memory 4g --executor-cores 4

If i increase the # of executors or Memory , still it does not make any differences.
I have set the # of Partitions to 70 , not sure if i increase this there might be performance ?

Any suggestion to reduce the time of Save will be helpfull.

Thanks in Advance

答案1

得分: 1

fullFile
   .coalesce(1)
   .saveAsTextFile(args[4], GzipCodec.class);

这里使用了 coalesce(1)，意味着你正在将分区数减少到仅为 1，这就是为什么它花费了更多的时间。因为在写入时只有一个分区，所以只有一个任务/执行器会将整个数据写入所需的位置。如果你想要更快地写入，可以增加 coalesce 中的分区值。
简单地移除 coalesce 或增加 coalesce 中的值。你可以在 Spark UI 中观察写入数据时的分区数量。


<details>
<summary>英文:</summary>

fullFile
.coalesce(1)
.saveAsTextFile(args[4], GzipCodec.class);

Here you&#39;re using `coalesce(1)` means you&#39;re reducing no. of partition to 1 only that&#39;s why it is taking more time. As Their is only one partition at the time of writing so only one task/executor will write the whole data in desired location. If you want to write faster than increase the partition value in coalesce. 
Simply remove `coalesce` or increase value in `coalesce`. You can no. partition while writing data in spark UI.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

spark的SAVEASTEXTfile操作花费了很多时间 – 1.6.3

问题

答案1

春季Kafka配置用于两种不同的Kafka集群设置

尝试理解整数数组内的charAt方法

Spring Boot “Failed to execute CommandLineRunner Error”

从链表中删除对象的方法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。