2023年8月9日 11:47:28go评论106阅读模式

英文:

How to convert JSON object as a value in a column in SPARK AZURE-DATABRICKS using SCALA as per requirement

问题

大家能帮我解决下面的问题吗：

我有一个 Spark DataFrame，其中一个列的值是以 JSON 格式呈现的。请参考下面的截图：

我的需求如下：

我需要从单行中的所有3个对象中选择非空的 "average_rating" 属性。如果没有非空值，则选择 null。我需要将该值添加到 "Average_Rating" 列中。我需要从 "average_rating" 不为空的对象中选择 "STATUS"，并将该值添加到 "average_rating_status" 列中。
我需要从单行中的所有3个对象中选择非空的 "number_of_recent_voters" 属性。如果没有非空值，则选择 null。我需要将该值添加到 "number_of_recent_voters" 列中。我需要从 "number_of_recent_voters" 不为空的对象中选择 "STATUS"，并将该值添加到 "number_of_recent_voters_status" 列中。
我需要从单行中的所有3个对象中选择非空的 "number_of_voters" 属性。如果没有非空值，则选择 null。我需要将该值添加到 "number_of_voters" 列中。我需要从 "number_of_voters" 不为空的对象中选择 "STATUS"，并将该值添加到 "number_of_voters_status" 列中。

我需要在 Azure 数据工厂笔记本中使用 Scala 编写代码。有人可以帮忙提供代码吗？

谢谢

编辑：

+------------+----------------+-------------------------+------------------+-----------------------+--------------------------------+-------------------------+
| compliance | average_rating | number_of_recent_voters | number_of_voters | average_rating_status | number_of_recent_voters_status | number_of_voters_status |
+------------+----------------+-------------------------+------------------+-----------------------+--------------------------------+-------------------------+
| true       | 4.7            | 254                     | 254              | PASS                  | PASS                           | PASS                    |
+------------+----------------+-------------------------+------------------+-----------------------+--------------------------------+-------------------------+

输出应该如上所示。

英文:

Guys Can someone help me in resolving the below problem:

I have a spark dataframe in which one of the column's values is coming in JSON format. Please find the below screen shot for your reference:

My requirement is as follows:

I have to select the "average_rating" attribute which is not null out of all the 3 objects in a single row. If there is none I take null. I have to add this value to the column "Average_Rating". I have to select the "STATUS" from the object where "average_rating" is not null and add the value to a column "average_rating_status".
I have to select the "number_of_recent_voters" attribute which is not null out of all the 3 objects in a single row. If there is none I take null. I have to add this value to the column "number_of_recent_voters". I have to select the "STATUS" from the object where "number_of_recent_voters" is not null and add the value to a column "number_of_recent_voters_status".
I have to select the "number_of_voters" attribute which is not null out of all the 3 objects in a single row. If there is none I take null. I have to add this value to the column "number_of_voters". I have to select the "STATUS" from the object where "number_of_voters" is not null and add the value to a column "number_of_voters_status".

I have to write the code in scala in my Azure data bricks notebook. Can anyone please help with the code.

Thank you

Edit:

+------------+----------------+-------------------------+------------------+-----------------------+--------------------------------+-------------------------+
| compliance | average_rating | number_of_recent_voters | number_of_voters | average_rating_status | number_of_recent_voters_status | number_of_voters_status |
+------------+----------------+-------------------------+------------------+-----------------------+--------------------------------+-------------------------+
| true       | 4.7            | 254                     | 254              | PASS                  | PASS                           | PASS                    |
+------------+----------------+-------------------------+------------------+-----------------------+--------------------------------+-------------------------+

Output should come like above.

答案1

得分: 1

这段代码通过使用row_number函数和在rating字段上定义的窗口规范，为数据框添加了一个row_number列。它使用from_json函数将rating列解析为一个结构体数组，然后使用inline_outer函数将数组展开为单独的行。它按row_number列对生成的数据框进行分组，并计算每个组的average_rating、number_of_recent_voters和number_of_voters列的最大值。然后，它将生成的数据框与原始数据框根据row_number和average_rating列进行连接，并从原始数据框中选择status列。它将status列重命名为average_rating_status。
类似地，对于number_of_recent_voters和number_of_voters列也是如此，并将status列分别重命名为number_of_recent_voters_status和number_of_voters_status。

生成的数据框data6包含所需的结果。

英文:

Code

val data = List(
  (true, &quot;&quot;&quot;[{&quot;average_rating&quot;:4.7,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:254,&quot;status&quot;:&quot;FAIL&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:254,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;}]&quot;&quot;&quot;),
  (true, &quot;&quot;&quot;[{&quot;average_rating&quot;:2.7,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:123,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:324,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;}]&quot;&quot;&quot;)
).toDF(&quot;compliance&quot;, &quot;rating&quot;)
data.show(false)



import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{from_json, row_number}
val windowSpec = Window.orderBy(&quot;rating&quot;)
val data1 = data.withColumn(&quot;row_number&quot;, row_number.over(windowSpec))
val data2 = data1.selectExpr(&quot;compliance&quot;, &quot;row_number&quot;, &quot;&quot;&quot;inline_outer(from_json(rating, &#39;ARRAY&lt;STRUCT&lt;average_rating DOUBLE, number_of_recent_voters DOUBLE, number_of_voters DOUBLE, status STRING&gt;&gt;&#39;))&quot;&quot;&quot;)
data2.show()


val data3 = data2.groupBy(&quot;row_number&quot;).max(&quot;average_rating&quot;,&quot;number_of_recent_voters&quot;,&quot;number_of_voters&quot;).withColumnRenamed(&quot;max(average_rating)&quot;,&quot;average_rating&quot;).withColumnRenamed(&quot;max(number_of_recent_voters)&quot;,&quot;number_of_recent_voters&quot;).withColumnRenamed(&quot;max(number_of_voters)&quot;,&quot;number_of_voters&quot;)
data3.show(false)

val data4 = data3.join(data2, Seq(&quot;row_number&quot;, &quot;average_rating&quot;), &quot;inner&quot;).select(data3.col(&quot;*&quot;), data2.col(&quot;status&quot;)).withColumnRenamed(&quot;status&quot;,&quot;average_rating_status&quot;)
display(data4)

val data5 = data4.join(data2, Seq(&quot;row_number&quot;,&quot;number_of_recent_voters&quot;),&quot;inner&quot;).select(data4.col(&quot;*&quot;),data2.col(&quot;status&quot;)).withColumnRenamed(&quot;status&quot;,&quot;number_of_recent_voters_status&quot;)
display(data5)

val data6 = data5.join(data2, Seq(&quot;row_number&quot;,&quot;number_of_voters&quot;),&quot;inner&quot;).select(data5.col(&quot;*&quot;),data2.col(&quot;status&quot;)).withColumnRenamed(&quot;status&quot;,&quot;number_of_voters_status&quot;)
display(data6)

This code adds a row_number column to the dataframe using the row_number function and the window specification defined on rating field. It uses the from_json function to parse the rating column as an array of structs, and then uses the inline_outer function to explode the array into separate rows. It groups the resulting dataframe by the row_number column and calculates the maximum value of the average_rating, number_of_recent_voters, and number_of_voters columns for each group. It then joins the resulting dataframe with the original dataframe on the row_number and average_rating columns, and selects the status column from the original dataframe. It renames the status column to average_rating_status.
Similarly, it is done for the number_of_recent_voters and number_of_voters columns, and renames the status columns to number_of_recent_voters_status and number_of_voters_status, respectively.

The resulting dataframe data6 has the required results.

答案2

得分: 0

你可以使用from_json函数与模式来实现，下面是一个示例解决方案。

scala> val data = List((true, """[{"average_rating":4.7,"number_of_recent_voters":null,"number_of_voters":null,"status":"PASS"},{"average_rating":null,"number_of_recent_voters":null,"number_of_voters":254,"status":"PASS"},{"average_rating":null,"number_of_recent_voters":254,"number_of_voters":null,"status":"PASS"}]""")).toDF("compliance","rating")
data: org.apache.spark.sql.DataFrame = [compliance: boolean, rating: string]

scala> data.show(false)
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|compliance|rating                                                                                                                                                                                                                                                                                     |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|true      |[{"average_rating":4.7,"number_of_recent_voters":null,"number_of_voters":null,"status":"PASS"},{"average_rating":null,"number_of_recent_voters":null,"number_of_voters":254,"status":"PASS"},{"average_rating":null,"number_of_recent_voters":254,"number_of_voters":null,"status":"PASS"}]|
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

scala> data.printSchema
root
 |-- compliance: boolean (nullable = false)
 |-- rating: string (nullable = true)

scala> val finalDF = data.selectExpr("compliance", """inline_outer(from_json(rating, 'ARRAY<STRUCT<average_rating DOUBLE, number_of_recent_voters DOUBLE, number_of_voters DOUBLE, status STRING>>'))""")
finalDF: org.apache.spark.sql.DataFrame = [compliance: boolean, average_rating: double ... 3 more fields]

scala> finalDF.printSchema
root
 |-- compliance: boolean (nullable = false)
 |-- average_rating: double (nullable = true)
 |-- number_of_recent_voters: double (nullable = true)
 |-- number_of_voters: double (nullable = true)
 |-- status: string (nullable = true)

scala> finalDF.show(false)
+----------+--------------+-----------------------+----------------+------+
|compliance|average_rating|number_of_recent_voters|number_of_voters|status|
+----------+--------------+-----------------------+----------------+------+
|true      |4.7           |null                   |null            |PASS  |
|true      |null          |null                   |254.0           |PASS  |
|true      |null          |254.0                  |null            |PASS  |
+----------+--------------+-----------------------+----------------+------+

英文:

You can use fron_json with schema & below is sample solution.

scala&gt; val data = List(   (true, &quot;&quot;&quot;[{&quot;average_rating&quot;:4.7,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:254,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:254,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;}]&quot;&quot;&quot;) ).toDF(&quot;compliance&quot;,&quot;rating&quot;)
data: org.apache.spark.sql.DataFrame = [compliance: boolean, rating: string]
scala&gt; data.show(false)
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|compliance|rating                                                                                                                                                                                                                                                                                     |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|true      |[{&quot;average_rating&quot;:4.7,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:null,&quot;number_of_voters&quot;:254,&quot;status&quot;:&quot;PASS&quot;},{&quot;average_rating&quot;:null,&quot;number_of_recent_voters&quot;:254,&quot;number_of_voters&quot;:null,&quot;status&quot;:&quot;PASS&quot;}]|
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

scala&gt; data.printSchema
root
|-- compliance: boolean (nullable = false)
|-- rating: string (nullable = true)

scala&gt; val finalDF = data.selectExpr(&quot;compliance&quot;, &quot;&quot;&quot;inline_outer(from_json(rating, &#39;ARRAY&lt;STRUCT&lt;average_rating DOUBLE, number_of_recent_voters DOUBLE, number_of_voters DOUBLE, status STRING&gt;&gt;&#39;))&quot;&quot;&quot;)
finalDF: org.apache.spark.sql.DataFrame = [compliance: boolean, average_rating: double ... 3 more fields]

scala&gt; finalDF.printSchema
root
|-- compliance: boolean (nullable = false)
|-- average_rating: double (nullable = true)
|-- number_of_recent_voters: double (nullable = true)
|-- number_of_voters: double (nullable = true)
|-- status: string (nullable = true)


scala&gt; finalDF.show(false)
+----------+--------------+-----------------------+----------------+------+
|compliance|average_rating|number_of_recent_voters|number_of_voters|status|
+----------+--------------+-----------------------+----------------+------+
|true      |4.7           |null                   |null            |PASS  |
|true      |null          |null                   |254.0           |PASS  |
|true      |null          |254.0                  |null            |PASS  |
+----------+--------------+-----------------------+----------------+------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据要求，在SPARK AZURE-DATABRICKS中使用SCALA将JSON对象转换为列的值

问题

答案1

答案2

从数据框的每个组/ID中从底部删除行。

有没有办法将Dependabot与sbt集成以进行依赖项更新？

递归 GO vs Scala

在 pandas 中创建一列，该列中包含每天的平均损失值，放在列的最后一行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论