英文:
spark SAVEASTEXTfile is taking lot of time - 1.6.3
问题
从Mongo中提取数据。处理数据,然后将数据存储在HDFS中。
提取和处理100万条记录的时间少于1.1分钟。
提取代码
JavaRDD<Document> rdd = MongoSpark.load(jsc);
处理代码
JavaRDD<String> fullFile = rdd.map(new Function<Document, String>() {
public String call(Document s) {
return JsonParsing.returnKeyJson(JsonParsing.returnFlattenMapJson(s.toJson()), args[3].split(","), extractionDetails);
}
});
System.out.println("Records Downloaded - " + fullFile.count());
这个过程少于1.1分钟。因为在那一点上我获取了RDD的计数。
然后我有一个保存命令,如下所示,
fullFile
.coalesce(1)
.saveAsTextFile(args[4], GzipCodec.class);
这需要至少15到20分钟才能保存到HDFS中。
不确定为什么会花那么多时间。
如果有什么可以加快过程的方法,请告诉我。
我正在使用以下选项运行它,
--num-executors 4 --executor-memory 4g --executor-cores 4
如果我增加执行者或内存数量,仍然没有任何区别。
我将分区数设置为70,不确定是否增加这个数量可能会有性能提升?
任何减少保存时间的建议都将很有帮助。
提前致谢
英文:
I Extract Data from Mongo. Process the data and then store the data in HDFS.
Extraction and Processing of 1M records completes is less than 1.1 Minute.
Extraction Code
JavaRDD<Document> rdd = MongoSpark.load(jsc);
Processing Code
JavaRDD<String> fullFile = rdd.map(new Function<Document, String>() {
public String call(Document s) {
// System.out.println(" About to Transform Json ----- " + s.toJson());
return JsonParsing.returnKeyJson(JsonParsing.returnFlattenMapJson(s.toJson()),args[3].split(","),extractionDetails);
}
});
System.out.println("Records Downloaded - " + fullFile.count());
This complete is less than 1.1 Minute. As i fetch the count of RDD at that point.
After that i have Save command which as follows,
fullFile
.coalesce(1)
.saveAsTextFile(args[4], GzipCodec.class);
This takes atleast 15 to 20 min to save it into HDFS.
Not sure why it takes much time.
Let me know if anything can be done to faster the process.
I am using the following options to run it,
--num-executors 4 --executor-memory 4g --executor-cores 4
If i increase the # of executors or Memory , still it does not make any differences.
I have set the # of Partitions to 70 , not sure if i increase this there might be performance ?
Any suggestion to reduce the time of Save will be helpfull.
Thanks in Advance
答案1
得分: 1
fullFile
.coalesce(1)
.saveAsTextFile(args[4], GzipCodec.class);
这里使用了 coalesce(1)
,意味着你正在将分区数减少到仅为 1,这就是为什么它花费了更多的时间。因为在写入时只有一个分区,所以只有一个任务/执行器会将整个数据写入所需的位置。如果你想要更快地写入,可以增加 coalesce
中的分区值。
简单地移除 coalesce
或增加 coalesce
中的值。你可以在 Spark UI 中观察写入数据时的分区数量。
<details>
<summary>英文:</summary>
fullFile
.coalesce(1)
.saveAsTextFile(args[4], GzipCodec.class);
Here you're using `coalesce(1)` means you're reducing no. of partition to 1 only that's why it is taking more time. As Their is only one partition at the time of writing so only one task/executor will write the whole data in desired location. If you want to write faster than increase the partition value in coalesce.
Simply remove `coalesce` or increase value in `coalesce`. You can no. partition while writing data in spark UI.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论