Scala Spark – 将JSON列拆分为多个列

huangapple go评论126阅读模式
英文:

Scala Spark - Split JSON column to multiple columns

问题

Sure, here is the translated code snippet:

Scala新手使用**Spark 2.3.0**
我正在使用一个UDF创建一个DataFrame该UDF创建一个JSON字符串列

val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))

它的输出如下

+----------------+---------------------------------------+
| encrypted_data | decrypted_json                        |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"}        |
+----------------+---------------------------------------+

UDF是外部代码我无法更改我想将decrypted_json列拆分为各个列以便输出的DataFrame如下所示

+----------------+--------+-------------+
| encrypted_data | a      | b           |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+

Is there anything else you need assistance with?

英文:

Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:

val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))

it outputs as follows:

+----------------+---------------------------------------+
| encrypted_data | decrypted_json                        |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"}        |
+----------------+---------------------------------------+

The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:

+----------------+----------------------+
| encrypted_data | a      | b           |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+

答案1

得分: 2

以下解决方案受到@Jacek Laskowski提供的解决方案之一的启发:

import org.apache.spark.sql.types._
val JsonSchema = new StructType()
  .add("a".string)
  .add("b".string)
val schema = new StructType()
  .add("encrypted_data".string)
  .add("decrypted_json".array(JsonSchema))

val schemaAsJson = schema.json

import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)

import org.apache.spark.sql.functions._

val rawJsons = Seq("""
  {
    "encrypted_data" : "eyJleHAiOjE1",
    "decrypted_json" : [
      {
        "a" : "547.65",
        "b" : "Some Data"
      }
    ]
  }
""").toDF("rawjson")

val people = rawJsons
  .select(from_json("rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  .select("json.*") 
  .withColumn("address", explode("decrypted_json")) 
  .drop("decrypted_json")  
  .select("encrypted_data", "address.*")

请查看链接以获取原始解决方案和解释。
希望对您有所帮助。

英文:

Below solution is inspired by one of the solutions given by @Jacek Laskowski:

import org.apache.spark.sql.types._
val JsonSchema = new StructType()
  .add($"a".string)
  .add($"b".string)
val schema = new StructType()
  .add($"encrypted_data".string)
  .add($"decrypted_json".array(JsonSchema))

val schemaAsJson = schema.json

import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)

import org.apache.spark.sql.functions._

val rawJsons = Seq("""
  {
    "encrypted_data" : "eyJleHAiOjE1",
    "decrypted_json" : [
      {
        "a" : "547.65",
        "b" : "Some Data"
      }
    ]
  }
""").toDF("rawjson")

val people = rawJsons
  .select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  .select("json.*") // <-- flatten the struct field
  .withColumn("address", explode($"decrypted_json")) // <-- explode the array field
  .drop("decrypted_json")  // <-- no longer needed
  .select("encrypted_data", "address.*") // <-- flatten the struct field

Scala Spark – 将JSON列拆分为多个列

Please go through Link for the original solution with the explanation.
I hope that helps.

答案2

得分: 0

使用from_json可以将JSON解析为Struct类型,然后从该数据框中选择列。您需要知道JSON的架构。以下是示例代码:

val sparkSession = //创建Spark会话
import sparkSession.implicits._

val jsonData = """{"a": 547.65, "b": "Some Data"}"""
val schema = StructType(
  List(
    StructField("a", DoubleType, nullable = false),
    StructField("b", StringType, nullable = false)
  )
)

val df = sparkSession.createDataset(Seq(("dummy data", jsonData))).toDF("string_column", "json_column")
val dfWithParsedJson = df.withColumn("json_data", from_json($"json_column", schema))

dfWithParsedJson.select($"string_column", $"json_column", $"json_data.a", $"json_data.b").show()

结果如下:

+-------------+------------------------------+------+---------+
|string_column|json_column                   |a     |b        |
+-------------+------------------------------+------+---------+
|dummy data   |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+
英文:

Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -

    val sparkSession = //create spark session
    import sparkSession.implicits._

    val jsonData = """{"a":547.65 , "b":"Some Data"}"""
    val schema = {StructType(
      List(
        StructField("a", DoubleType, nullable = false),
        StructField("b", StringType, nullable = false)
      ))}

    val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
    val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))

    dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()

Result

+-------------+------------------------------+------+---------+
|string_column|json_column                   |a     |b        |
+-------------+------------------------------+------+---------+
|dummy data   |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+

huangapple
  • 本文由 发表于 2020年1月6日 21:19:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/59612868.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定