英文:
Pyspark : creating a Nested json file
问题
{
"NewData": [
{"id": "1", "number": "smith", "name": "uber", "age": 12},
{"id": "2", "number": "jon", "name": "lunch", "age": 13},
{"id": "3", "number": "jocelyn", "name": "rental", "age": 15},
{"id": "4", "number": "megan", "name": "sds", "age": 15}
]
}
英文:
I want to create a nested json file from data in PySpark from the following data.
I wanted to convert this into Nested json file which should have following structure.
{    "NewData" : [  		{"id":"1","number":"smith","name":"uber","age":12},  		{"id":"2","number":"jon","name":"lunch","age":13},  		{"id":"3","number":"jocelyn","name":"rental","age":15},  		{"id":"4","number":"megan","name":"sds","age":15}
             ]  }
How to put the correct output in a json file
Can you help me achieve this?
data = [(1,12,"smith", "uber"),
         (2,13,"jon","lunch"),
         (3,15,"jocelyn","rental"),
         (4,15,"megan","sds")
         ]
 
 
 schema = StructType([
 StructField('id', IntegerType(), True),
 StructField('age', IntegerType(), True),
 StructField('number', StringType(), True),
 StructField('name', StringType(), True)
                     ])
 df = spark.createDataFrame(data,schema)
 
 df.show(truncate=False)
 df = df.withColumn("NewData", F.lit("NewData"))
 
 df2 = df.groupBy('NewData').agg(F.collect_list(
 F.to_json(F.struct('id','number', 'name', 'age'))
                           ).alias('values')
                 ))
 
 df2.show(truncate=False)
答案1
得分: 0
groupBy-> 根据常量值agg()-> 使用别名Newdata
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1, 12, "smith", "uber"),
        (2, 13, "jon", "lunch"),
        (3, 15, "jocelyn", "rental"),
        (4, 15, "megan", "sds")
        ]
schema = StructType([
 StructField('id', IntegerType(), True),
 StructField('age', IntegerType(), True),
 StructField('number', StringType(), True),
 StructField('name', StringType(), True)
                     ])
df = spark.createDataFrame(data, schema)
df.show(truncate=False)
df2 = df.groupBy(lit(1)).agg(collect_list(struct('id', 'number', 'name', 'age')).alias('NewData')).\
  drop("1")
df2.write.mode("overwrite").format("json").save("<directory_path>")
print(dbutils.fs.head("<file_path>"))
# {"NewData":[{"id":1,"number":"smith","name":"uber","age":12},{"id":2,"number":"jon","name":"lunch","age":13},{"id":3,"number":"jocelyn","name":"rental","age":15},{"id":4,"number":"megan","name":"sds","age":15}]}
英文:
You don't have to use to_json function as this function results string json object.
groupBy-> on constant valueagg()-> with alias name asNewdata
Example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1,12,"smith", "uber"),
         (2,13,"jon","lunch"),
         (3,15,"jocelyn","rental"),
         (4,15,"megan","sds")
         ]
 
 
schema = StructType([
 StructField('id', IntegerType(), True),
 StructField('age', IntegerType(), True),
 StructField('number', StringType(), True),
 StructField('name', StringType(), True)
                     ])
df = spark.createDataFrame(data,schema)
 
df.show(truncate=False)
     
df2 = df.groupBy(lit(1)).agg(collect_list(struct('id','number', 'name', 'age')).alias('NewData')).\
  drop("1")
 
df2.write.mode("overwrite").format("json").save("<directory_path>")
print(dbutils.fs.head("<file_path>"))
#{"NewData":[{"id":1,"number":"smith","name":"uber","age":12},{"id":2,"number":"jon","name":"lunch","age":13},{"id":3,"number":"jocelyn","name":"rental","age":15},{"id":4,"number":"megan","name":"sds","age":15}]}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论