英文:
Apache spark : best way to join 2 hive tables
问题
a) 使用Spark SQL:
spark.sql("select * from tableA join tableB on a=b")
b) 直接读取Parquet文件:
val df1 = spark.read.parquet("/hdfs/location1")
val df2 = spark.read.parquet("/hdfs/location2")
val joined = df1.join(df2, Seq("id"), "inner")
有什么不同吗?如果表是托管的,会有什么不同吗?
我注意到计划是相同的,但是从Hive连接表时,它总是在执行器上下载整个文件大小。
英文:
what is the best way to join 2 hive tables (external), with spark, in terms of performance ?
a) with spark sql
spark.sql("select * from tableA join tableB on a=b)
b) reading parquet files directly
val df1 = spark.read.parquet("/hdfs/location1")
val df2 = spark.read.parquet("/hdfs/location2")
val joined = df1.join(df2, Seq("id"), "inner")
are there any difference ? would it be a difference if the tables were managed ?
I noticed that the plans are the same but joining the tables from hive it is always downloading on the executors the whole file size.
答案1
得分: 1
我会说,更确切地说,这取决于开发者在选择使用Spark SQL还是DataFrame时的舒适程度。就性能而言,实际上并不重要,因为底层的Spark内置函数相互对应(DataFrame和Spark-SQL)。
有一段时间以前,当我思考这个问题时,我正在阅读“数据人员的自白”博客 - 数据框架与Spark SQL
他在博客的最后一节中提到,
我认为SparkSQL和DataFrame在管道中的区别可能更多是理论和情感上的,而不是其他任何东西。
我绝对同意这些话。
英文:
I would say, rather it depends on a developer's comfort on selecting either Spark SQL or DataFrame, he/she wants to use. It really does not matter in terms of performance, since the underlying spark built-ins functions mirrors each other ( DataFrame and Spark-SQL ).
A while ago, when I had this thought, I was going through blog by Confessions of Data guy - dataframes-vs-sparksql
He mentions in the last section of the blog,
> I think the difference between SparkSQL and DataFrames in pipelines is probably more theoretical and emotional than anything else.
I definitely agree with these sentences.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论