2023年5月29日 18:54:09go评论64阅读模式

英文:

Apache spark : best way to join 2 hive tables

问题

a) 使用Spark SQL：

spark.sql("select * from tableA join tableB on a=b")

b) 直接读取Parquet文件：

val df1 = spark.read.parquet("/hdfs/location1")
val df2 = spark.read.parquet("/hdfs/location2")
val joined = df1.join(df2, Seq("id"), "inner")

有什么不同吗？如果表是托管的，会有什么不同吗？

我注意到计划是相同的，但是从Hive连接表时，它总是在执行器上下载整个文件大小。

英文:

what is the best way to join 2 hive tables (external), with spark, in terms of performance ?

a) with spark sql

spark.sql("select * from tableA join tableB on a=b)

b) reading parquet files directly

val df1 = spark.read.parquet(&quot;/hdfs/location1&quot;)
val df2 = spark.read.parquet(&quot;/hdfs/location2&quot;)
val joined = df1.join(df2, Seq(&quot;id&quot;), &quot;inner&quot;)

are there any difference ? would it be a difference if the tables were managed ?

I noticed that the plans are the same but joining the tables from hive it is always downloading on the executors the whole file size.

答案1

得分: 1

我会说，更确切地说，这取决于开发者在选择使用Spark SQL还是DataFrame时的舒适程度。就性能而言，实际上并不重要，因为底层的Spark内置函数相互对应（DataFrame和Spark-SQL）。

有一段时间以前，当我思考这个问题时，我正在阅读“数据人员的自白”博客 - 数据框架与Spark SQL

他在博客的最后一节中提到，

我认为SparkSQL和DataFrame在管道中的区别可能更多是理论和情感上的，而不是其他任何东西。

我绝对同意这些话。

英文:

I would say, rather it depends on a developer's comfort on selecting either Spark SQL or DataFrame, he/she wants to use. It really does not matter in terms of performance, since the underlying spark built-ins functions mirrors each other ( DataFrame and Spark-SQL ).

A while ago, when I had this thought, I was going through blog by Confessions of Data guy - dataframes-vs-sparksql

He mentions in the last section of the blog,

> I think the difference between SparkSQL and DataFrames in pipelines is probably more theoretical and emotional than anything else.

I definitely agree with these sentences.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Apache Spark：连接两个Hive表的最佳方法。

问题

答案1

如何在PySpark中旋转两列

尝试安装后打开 Spark，出现错误：无法找到任何与版本 “1.8” 匹配的 JVM。

Hive. Does ALTER TABLE tablename CHANGE name1 name2 newdatatype; Remove the underline data for name1?

how to convert row from csv to ArrayType in Apache spark java?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论