英文:
Scala Spark - Split JSON column to multiple columns
问题
Sure, here is the translated code snippet:
Scala新手,使用**Spark 2.3.0**。
我正在使用一个UDF创建一个DataFrame,该UDF创建一个JSON字符串列:
val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
它的输出如下:
+----------------+---------------------------------------+
| encrypted_data | decrypted_json |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
+----------------+---------------------------------------+
UDF是外部代码,我无法更改。我想将decrypted_json列拆分为各个列,以便输出的DataFrame如下所示:
+----------------+--------+-------------+
| encrypted_data | a | b |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+
Is there anything else you need assistance with?
英文:
Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:
val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
it outputs as follows:
+----------------+---------------------------------------+
| encrypted_data | decrypted_json |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
+----------------+---------------------------------------+
The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:
+----------------+----------------------+
| encrypted_data | a | b |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+
答案1
得分: 2
以下解决方案受到@Jacek Laskowski提供的解决方案之一的启发:
import org.apache.spark.sql.types._
val JsonSchema = new StructType()
.add("a".string)
.add("b".string)
val schema = new StructType()
.add("encrypted_data".string)
.add("decrypted_json".array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq("""
{
"encrypted_data" : "eyJleHAiOjE1",
"decrypted_json" : [
{
"a" : "547.65",
"b" : "Some Data"
}
]
}
""").toDF("rawjson")
val people = rawJsons
.select(from_json("rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*")
.withColumn("address", explode("decrypted_json"))
.drop("decrypted_json")
.select("encrypted_data", "address.*")
请查看链接以获取原始解决方案和解释。
希望对您有所帮助。
英文:
Below solution is inspired by one of the solutions given by @Jacek Laskowski:
import org.apache.spark.sql.types._
val JsonSchema = new StructType()
.add($"a".string)
.add($"b".string)
val schema = new StructType()
.add($"encrypted_data".string)
.add($"decrypted_json".array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq("""
{
"encrypted_data" : "eyJleHAiOjE1",
"decrypted_json" : [
{
"a" : "547.65",
"b" : "Some Data"
}
]
}
""").toDF("rawjson")
val people = rawJsons
.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*") // <-- flatten the struct field
.withColumn("address", explode($"decrypted_json")) // <-- explode the array field
.drop("decrypted_json") // <-- no longer needed
.select("encrypted_data", "address.*") // <-- flatten the struct field
Please go through Link for the original solution with the explanation.
I hope that helps.
答案2
得分: 0
使用from_json
可以将JSON解析为Struct类型,然后从该数据框中选择列。您需要知道JSON的架构。以下是示例代码:
val sparkSession = //创建Spark会话
import sparkSession.implicits._
val jsonData = """{"a": 547.65, "b": "Some Data"}"""
val schema = StructType(
List(
StructField("a", DoubleType, nullable = false),
StructField("b", StringType, nullable = false)
)
)
val df = sparkSession.createDataset(Seq(("dummy data", jsonData))).toDF("string_column", "json_column")
val dfWithParsedJson = df.withColumn("json_data", from_json($"json_column", schema))
dfWithParsedJson.select($"string_column", $"json_column", $"json_data.a", $"json_data.b").show()
结果如下:
+-------------+------------------------------+------+---------+
|string_column|json_column |a |b |
+-------------+------------------------------+------+---------+
|dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+
英文:
Using from_jason
you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -
val sparkSession = //create spark session
import sparkSession.implicits._
val jsonData = """{"a":547.65 , "b":"Some Data"}"""
val schema = {StructType(
List(
StructField("a", DoubleType, nullable = false),
StructField("b", StringType, nullable = false)
))}
val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))
dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()
Result
+-------------+------------------------------+------+---------+
|string_column|json_column |a |b |
+-------------+------------------------------+------+---------+
|dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论