Apache Spark:连接两个Hive表的最佳方法。

huangapple go评论64阅读模式
英文:

Apache spark : best way to join 2 hive tables

问题

a) 使用Spark SQL:

spark.sql("select * from tableA join tableB on a=b")

b) 直接读取Parquet文件:

val df1 = spark.read.parquet("/hdfs/location1")
val df2 = spark.read.parquet("/hdfs/location2")
val joined = df1.join(df2, Seq("id"), "inner")

有什么不同吗?如果表是托管的,会有什么不同吗?

我注意到计划是相同的,但是从Hive连接表时,它总是在执行器上下载整个文件大小。

英文:

what is the best way to join 2 hive tables (external), with spark, in terms of performance ?

a) with spark sql

spark.sql("select * from tableA join tableB on a=b)

b) reading parquet files directly

val df1 = spark.read.parquet("/hdfs/location1")
val df2 = spark.read.parquet("/hdfs/location2")
val joined = df1.join(df2, Seq("id"), "inner")

are there any difference ? would it be a difference if the tables were managed ?

I noticed that the plans are the same but joining the tables from hive it is always downloading on the executors the whole file size.

答案1

得分: 1

我会说,更确切地说,这取决于开发者在选择使用Spark SQL还是DataFrame时的舒适程度。就性能而言,实际上并不重要,因为底层的Spark内置函数相互对应(DataFrame和Spark-SQL)。

有一段时间以前,当我思考这个问题时,我正在阅读“数据人员的自白”博客 - 数据框架与Spark SQL

他在博客的最后一节中提到,

我认为SparkSQL和DataFrame在管道中的区别可能更多是理论和情感上的,而不是其他任何东西。

我绝对同意这些话。

英文:

I would say, rather it depends on a developer's comfort on selecting either Spark SQL or DataFrame, he/she wants to use. It really does not matter in terms of performance, since the underlying spark built-ins functions mirrors each other ( DataFrame and Spark-SQL ).

A while ago, when I had this thought, I was going through blog by Confessions of Data guy - dataframes-vs-sparksql

He mentions in the last section of the blog,

> I think the difference between SparkSQL and DataFrames in pipelines is probably more theoretical and emotional than anything else.

I definitely agree with these sentences.

huangapple
  • 本文由 发表于 2023年5月29日 18:54:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356722.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定