2020年1月6日 21:19:21go评论168阅读模式

英文:

Scala Spark - Split JSON column to multiple columns

问题

Sure, here is the translated code snippet:

Scala新手，使用**Spark 2.3.0**。
我正在使用一个UDF创建一个DataFrame，该UDF创建一个JSON字符串列：
val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
它的输出如下：
+----------------+---------------------------------------+
| encrypted_data | decrypted_json                        |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"}        |
+----------------+---------------------------------------+
UDF是外部代码，我无法更改。我想将decrypted_json列拆分为各个列，以便输出的DataFrame如下所示：
+----------------+--------+-------------+
| encrypted_data | a      | b           |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+

Is there anything else you need assistance with?

英文:

Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:

val result: DataFrame = df.withColumn(&quot;decrypted_json&quot;, instance.decryptJsonUdf(df(&quot;encrypted_data&quot;)))

it outputs as follows:

+----------------+---------------------------------------+
| encrypted_data | decrypted_json                        |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {&quot;a&quot;:547.65 , &quot;b&quot;:&quot;Some Data&quot;}        |
+----------------+---------------------------------------+

The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:

+----------------+----------------------+
| encrypted_data | a      | b           |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | &quot;Some Data&quot; |
+----------------+--------+-------------+

答案1

得分: 2

以下解决方案受到@Jacek Laskowski提供的解决方案之一的启发：

import org.apache.spark.sql.types._
val JsonSchema = new StructType()
  .add("a".string)
  .add("b".string)
val schema = new StructType()
  .add("encrypted_data".string)
  .add("decrypted_json".array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq("""
  {
    "encrypted_data" : "eyJleHAiOjE1",
    "decrypted_json" : [
      {
        "a" : "547.65",
        "b" : "Some Data"
      }
    ]
  }
""").toDF("rawjson")
val people = rawJsons
  .select(from_json("rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  .select("json.*") 
  .withColumn("address", explode("decrypted_json")) 
  .drop("decrypted_json")  
  .select("encrypted_data", "address.*")

请查看链接以获取原始解决方案和解释。
希望对您有所帮助。

英文:

Below solution is inspired by one of the solutions given by @Jacek Laskowski:

import org.apache.spark.sql.types._
val JsonSchema = new StructType()
  .add($&quot;a&quot;.string)
  .add($&quot;b&quot;.string)
val schema = new StructType()
  .add($&quot;encrypted_data&quot;.string)
  .add($&quot;decrypted_json&quot;.array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq(&quot;&quot;&quot;
  {
    &quot;encrypted_data&quot; : &quot;eyJleHAiOjE1&quot;,
    &quot;decrypted_json&quot; : [
      {
        &quot;a&quot; : &quot;547.65&quot;,
        &quot;b&quot; : &quot;Some Data&quot;
      }
    ]
  }
&quot;&quot;&quot;).toDF(&quot;rawjson&quot;)
val people = rawJsons
  .select(from_json($&quot;rawjson&quot;, schemaAsJson, Map.empty[String, String]) as &quot;json&quot;)
  .select(&quot;json.*&quot;) // &lt;-- flatten the struct field
  .withColumn(&quot;address&quot;, explode($&quot;decrypted_json&quot;)) // &lt;-- explode the array field
  .drop(&quot;decrypted_json&quot;)  // &lt;-- no longer needed
  .select(&quot;encrypted_data&quot;, &quot;address.*&quot;) // &lt;-- flatten the struct field

Please go through Link for the original solution with the explanation.
I hope that helps.

答案2

得分: 0

使用from_json可以将JSON解析为Struct类型，然后从该数据框中选择列。您需要知道JSON的架构。以下是示例代码：

val sparkSession = //创建Spark会话
import sparkSession.implicits._
val jsonData = """{"a": 547.65, "b": "Some Data"}"""
val schema = StructType(
  List(
    StructField("a", DoubleType, nullable = false),
    StructField("b", StringType, nullable = false)
  )
)
val df = sparkSession.createDataset(Seq(("dummy data", jsonData))).toDF("string_column", "json_column")
val dfWithParsedJson = df.withColumn("json_data", from_json($"json_column", schema))
dfWithParsedJson.select($"string_column", $"json_column", $"json_data.a", $"json_data.b").show()

结果如下：

+-------------+------------------------------+------+---------+
|string_column|json_column                   |a     |b        |
+-------------+------------------------------+------+---------+
|dummy data   |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+

英文:

Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -

    val sparkSession = //create spark session
    import sparkSession.implicits._
    val jsonData = &quot;&quot;&quot;{&quot;a&quot;:547.65 , &quot;b&quot;:&quot;Some Data&quot;}&quot;&quot;&quot;
    val schema = {StructType(
      List(
        StructField(&quot;a&quot;, DoubleType, nullable = false),
        StructField(&quot;b&quot;, StringType, nullable = false)
      ))}
    val df = sparkSession.createDataset(Seq((&quot;dummy data&quot;,jsonData))).toDF(&quot;string_column&quot;,&quot;json_column&quot;)
    val dfWithParsedJson = df.withColumn(&quot;json_data&quot;,from_json($&quot;json_column&quot;,schema))
    dfWithParsedJson.select($&quot;string_column&quot;,$&quot;json_column&quot;,$&quot;json_data.a&quot;, $&quot;json_data.b&quot;).show()

Result

+-------------+------------------------------+------+---------+
|string_column|json_column                   |a     |b        |
+-------------+------------------------------+------+---------+
|dummy data   |{&quot;a&quot;:547.65 , &quot;b&quot;:&quot;Some Data&quot;}|547.65|Some Data|
+-------------+------------------------------+------+---------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scala Spark – 将JSON列拆分为多个列

问题

答案1

答案2

将Golang的JSON存储到PostgreSQL中

怎样在Java Spark中对一个包含array<string>类型的数据集进行单词统计？

Generalization in structs – golang

将用户ID传递给core-ajax方法的GET请求，以从Google Cloud存储中检索数据。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。