问题

我正在使用spark-sql 2.4.x版本，datastax-spark-cassandra-connector与Cassandra-3.x版本一起使用，还有kafka。

我有一个场景，从kafka主题中获取一些财务数据。数据（基础数据集）包含companyId、year和prev_year字段信息。

如果列year等于prev_year，则我需要与不同的表即exchange_rates进行连接。

如果列year不等于prev_year，则我需要返回基础数据集本身。

如何在spark-sql中实现这个目标？

英文:

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.

I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.

If columns year === prev_year then I need to join with different table i.e. exchange_rates.

If columns year =!= prev_year then I need to return the base dataset itself

How to do this in spark-sql ?

答案1

得分: 1

以下是您的翻译：

您可以参考以下方法处理您的情况。

scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
|        1|2016|     2017|  12|
|        1|2017|     2017|21.4|
|        2|2018|     2017|11.7|
|        2|2018|     2018|44.6|
|        3|2016|     2017|34.5|
|        4|2017|     2017|  56|
+---------+----+---------+----+

scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
|        1|12.3|
|        2|12.5|
|        3|22.3|
|        4|34.6|
|        5|45.2|
+---------+----+

scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))

scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))

scala> val joindf  = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")

scala> val finalDF = equaldf.union(joindf)

scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
|        1|2017|     2017|21.4|
|        2|2018|     2018|44.6|
|        4|2017|     2017|  56|
|        1|2016|     2017|12.3|
|        2|2018|     2017|12.5|
|        3|2016|     2017|22.3|
+---------+----+---------+----+

英文:

You can refer below approach for your case.

scala&gt; Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
|        1|2016|     2017|  12|
|        1|2017|     2017|21.4|
|        2|2018|     2017|11.7|
|        2|2018|     2018|44.6|
|        3|2016|     2017|34.5|
|        4|2017|     2017|  56|
+---------+----+---------+----+
scala&gt; exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
|        1|12.3|
|        2|12.5|
|        3|22.3|
|        4|34.6|
|        5|45.2|
+---------+----+
scala&gt; val equaldf = Input_df.filter(col(&quot;year&quot;) === col(&quot;prev_year&quot;))
scala&gt; val notequaldf = Input_df.filter(col(&quot;year&quot;) =!= col(&quot;prev_year&quot;))
scala&gt; val joindf  = notequaldf.alias(&quot;n&quot;).drop(&quot;rate&quot;).join(exch_rates.alias(&quot;e&quot;), List(&quot;companyId&quot;), &quot;left&quot;)
scala&gt; val finalDF = equaldf.union(joindf)
scala&gt; finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
|        1|2017|     2017|21.4|
|        2|2018|     2018|44.6|
|        4|2017|     2017|  56|
|        1|2016|     2017|12.3|
|        2|2018|     2017|12.5|
|        3|2016|     2017|22.3|
+---------+----+---------+----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Spark中处理这个问题

问题

答案1

在Databricks作业集群上安装Maven包。

在Java Spark Dataframe中比较日期。

Databricks代码不再工作，出现了“目录未找到”的错误。

这是您要翻译的内容： “Gremlin Spark Java Maven Project – Slow query response”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论