Pyspark:创建一个嵌套的JSON文件

huangapple go评论52阅读模式
英文:

Pyspark : creating a Nested json file

问题

{
"NewData": [
{"id": "1", "number": "smith", "name": "uber", "age": 12},
{"id": "2", "number": "jon", "name": "lunch", "age": 13},
{"id": "3", "number": "jocelyn", "name": "rental", "age": 15},
{"id": "4", "number": "megan", "name": "sds", "age": 15}
]
}

英文:

I want to create a nested json file from data in PySpark from the following data.

I wanted to convert this into Nested json file which should have following structure.

{    "NewData" : [  		{"id":"1","number":"smith","name":"uber","age":12},  		{"id":"2","number":"jon","name":"lunch","age":13},  		{"id":"3","number":"jocelyn","name":"rental","age":15},  		{"id":"4","number":"megan","name":"sds","age":15}
             ]  }

How to put the correct output in a json file

Can you help me achieve this?

data = [(1,12,"smith", "uber"),
         (2,13,"jon","lunch"),
         (3,15,"jocelyn","rental"),
         (4,15,"megan","sds")
         ]
 
 
 schema = StructType([
 StructField('id', IntegerType(), True),
 StructField('age', IntegerType(), True),
 StructField('number', StringType(), True),
 StructField('name', StringType(), True)
                     ])

 df = spark.createDataFrame(data,schema)
 
 df.show(truncate=False)

 df = df.withColumn("NewData", F.lit("NewData"))
 
 df2 = df.groupBy('NewData').agg(F.collect_list(
 F.to_json(F.struct('id','number', 'name', 'age'))
                           ).alias('values')
                 ))
 
 df2.show(truncate=False)

答案1

得分: 0

  • groupBy -> 根据常量值
  • agg() -> 使用别名 Newdata
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1, 12, "smith", "uber"),
        (2, 13, "jon", "lunch"),
        (3, 15, "jocelyn", "rental"),
        (4, 15, "megan", "sds")
        ]


schema = StructType([
 StructField('id', IntegerType(), True),
 StructField('age', IntegerType(), True),
 StructField('number', StringType(), True),
 StructField('name', StringType(), True)
                     ])

df = spark.createDataFrame(data, schema)

df.show(truncate=False)

df2 = df.groupBy(lit(1)).agg(collect_list(struct('id', 'number', 'name', 'age')).alias('NewData')).\
  drop("1")

df2.write.mode("overwrite").format("json").save("<directory_path>")

print(dbutils.fs.head("<file_path>"))
# {"NewData":[{"id":1,"number":"smith","name":"uber","age":12},{"id":2,"number":"jon","name":"lunch","age":13},{"id":3,"number":"jocelyn","name":"rental","age":15},{"id":4,"number":"megan","name":"sds","age":15}]}
英文:

You don't have to use to_json function as this function results string json object.

  • groupBy -> on constant value
  • agg() -> with alias name as Newdata

Example:

from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1,12,&quot;smith&quot;, &quot;uber&quot;),
         (2,13,&quot;jon&quot;,&quot;lunch&quot;),
         (3,15,&quot;jocelyn&quot;,&quot;rental&quot;),
         (4,15,&quot;megan&quot;,&quot;sds&quot;)
         ]
 
 
schema = StructType([
 StructField(&#39;id&#39;, IntegerType(), True),
 StructField(&#39;age&#39;, IntegerType(), True),
 StructField(&#39;number&#39;, StringType(), True),
 StructField(&#39;name&#39;, StringType(), True)
                     ])

df = spark.createDataFrame(data,schema)
 
df.show(truncate=False)
     
df2 = df.groupBy(lit(1)).agg(collect_list(struct(&#39;id&#39;,&#39;number&#39;, &#39;name&#39;, &#39;age&#39;)).alias(&#39;NewData&#39;)).\
  drop(&quot;1&quot;)
 
df2.write.mode(&quot;overwrite&quot;).format(&quot;json&quot;).save(&quot;&lt;directory_path&gt;&quot;)

print(dbutils.fs.head(&quot;&lt;file_path&gt;&quot;))
#{&quot;NewData&quot;:[{&quot;id&quot;:1,&quot;number&quot;:&quot;smith&quot;,&quot;name&quot;:&quot;uber&quot;,&quot;age&quot;:12},{&quot;id&quot;:2,&quot;number&quot;:&quot;jon&quot;,&quot;name&quot;:&quot;lunch&quot;,&quot;age&quot;:13},{&quot;id&quot;:3,&quot;number&quot;:&quot;jocelyn&quot;,&quot;name&quot;:&quot;rental&quot;,&quot;age&quot;:15},{&quot;id&quot;:4,&quot;number&quot;:&quot;megan&quot;,&quot;name&quot;:&quot;sds&quot;,&quot;age&quot;:15}]}

huangapple
  • 本文由 发表于 2023年7月27日 21:52:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76780444.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定