Scala Spark – 将JSON列拆分为多个列

huangapple go评论168阅读模式
英文:

Scala Spark - Split JSON column to multiple columns

问题

Sure, here is the translated code snippet:

  1. Scala新手使用**Spark 2.3.0**
  2. 我正在使用一个UDF创建一个DataFrameUDF创建一个JSON字符串列
  3. val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
  4. 它的输出如下
  5. +----------------+---------------------------------------+
  6. | encrypted_data | decrypted_json |
  7. +----------------+---------------------------------------+
  8. |eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
  9. +----------------+---------------------------------------+
  10. UDF是外部代码我无法更改我想将decrypted_json列拆分为各个列以便输出的DataFrame如下所示
  11. +----------------+--------+-------------+
  12. | encrypted_data | a | b |
  13. +----------------+--------+-------------+
  14. |eyJleHAiOjE1 ...| 547.65 | "Some Data" |
  15. +----------------+--------+-------------+

Is there anything else you need assistance with?

英文:

Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:

  1. val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))

it outputs as follows:

  1. +----------------+---------------------------------------+
  2. | encrypted_data | decrypted_json |
  3. +----------------+---------------------------------------+
  4. |eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
  5. +----------------+---------------------------------------+

The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:

  1. +----------------+----------------------+
  2. | encrypted_data | a | b |
  3. +----------------+--------+-------------+
  4. |eyJleHAiOjE1 ...| 547.65 | "Some Data" |
  5. +----------------+--------+-------------+

答案1

得分: 2

以下解决方案受到@Jacek Laskowski提供的解决方案之一的启发:

  1. import org.apache.spark.sql.types._
  2. val JsonSchema = new StructType()
  3. .add("a".string)
  4. .add("b".string)
  5. val schema = new StructType()
  6. .add("encrypted_data".string)
  7. .add("decrypted_json".array(JsonSchema))
  8. val schemaAsJson = schema.json
  9. import org.apache.spark.sql.types.DataType
  10. val dt = DataType.fromJson(schemaAsJson)
  11. import org.apache.spark.sql.functions._
  12. val rawJsons = Seq("""
  13. {
  14. "encrypted_data" : "eyJleHAiOjE1",
  15. "decrypted_json" : [
  16. {
  17. "a" : "547.65",
  18. "b" : "Some Data"
  19. }
  20. ]
  21. }
  22. """).toDF("rawjson")
  23. val people = rawJsons
  24. .select(from_json("rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  25. .select("json.*")
  26. .withColumn("address", explode("decrypted_json"))
  27. .drop("decrypted_json")
  28. .select("encrypted_data", "address.*")

请查看链接以获取原始解决方案和解释。
希望对您有所帮助。

英文:

Below solution is inspired by one of the solutions given by @Jacek Laskowski:

  1. import org.apache.spark.sql.types._
  2. val JsonSchema = new StructType()
  3. .add($"a".string)
  4. .add($"b".string)
  5. val schema = new StructType()
  6. .add($"encrypted_data".string)
  7. .add($"decrypted_json".array(JsonSchema))
  8. val schemaAsJson = schema.json
  9. import org.apache.spark.sql.types.DataType
  10. val dt = DataType.fromJson(schemaAsJson)
  11. import org.apache.spark.sql.functions._
  12. val rawJsons = Seq("""
  13. {
  14. "encrypted_data" : "eyJleHAiOjE1",
  15. "decrypted_json" : [
  16. {
  17. "a" : "547.65",
  18. "b" : "Some Data"
  19. }
  20. ]
  21. }
  22. """).toDF("rawjson")
  23. val people = rawJsons
  24. .select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  25. .select("json.*") // <-- flatten the struct field
  26. .withColumn("address", explode($"decrypted_json")) // <-- explode the array field
  27. .drop("decrypted_json") // <-- no longer needed
  28. .select("encrypted_data", "address.*") // <-- flatten the struct field

Scala Spark – 将JSON列拆分为多个列

Please go through Link for the original solution with the explanation.
I hope that helps.

答案2

得分: 0

使用from_json可以将JSON解析为Struct类型,然后从该数据框中选择列。您需要知道JSON的架构。以下是示例代码:

  1. val sparkSession = //创建Spark会话
  2. import sparkSession.implicits._
  3. val jsonData = """{"a": 547.65, "b": "Some Data"}"""
  4. val schema = StructType(
  5. List(
  6. StructField("a", DoubleType, nullable = false),
  7. StructField("b", StringType, nullable = false)
  8. )
  9. )
  10. val df = sparkSession.createDataset(Seq(("dummy data", jsonData))).toDF("string_column", "json_column")
  11. val dfWithParsedJson = df.withColumn("json_data", from_json($"json_column", schema))
  12. dfWithParsedJson.select($"string_column", $"json_column", $"json_data.a", $"json_data.b").show()

结果如下:

  1. +-------------+------------------------------+------+---------+
  2. |string_column|json_column |a |b |
  3. +-------------+------------------------------+------+---------+
  4. |dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
  5. +-------------+------------------------------+------+---------+
英文:

Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -

  1. val sparkSession = //create spark session
  2. import sparkSession.implicits._
  3. val jsonData = """{"a":547.65 , "b":"Some Data"}"""
  4. val schema = {StructType(
  5. List(
  6. StructField("a", DoubleType, nullable = false),
  7. StructField("b", StringType, nullable = false)
  8. ))}
  9. val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
  10. val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))
  11. dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()

Result

  1. +-------------+------------------------------+------+---------+
  2. |string_column|json_column |a |b |
  3. +-------------+------------------------------+------+---------+
  4. |dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
  5. +-------------+------------------------------+------+---------+

huangapple
  • 本文由 发表于 2020年1月6日 21:19:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/59612868.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定