问题

{
"NewData": [
{"id": "1", "number": "smith", "name": "uber", "age": 12},
{"id": "2", "number": "jon", "name": "lunch", "age": 13},
{"id": "3", "number": "jocelyn", "name": "rental", "age": 15},
{"id": "4", "number": "megan", "name": "sds", "age": 15}
]
}

英文:

I want to create a nested json file from data in PySpark from the following data.

I wanted to convert this into Nested json file which should have following structure.

{    &quot;NewData&quot; : [  		{&quot;id&quot;:&quot;1&quot;,&quot;number&quot;:&quot;smith&quot;,&quot;name&quot;:&quot;uber&quot;,&quot;age&quot;:12},  		{&quot;id&quot;:&quot;2&quot;,&quot;number&quot;:&quot;jon&quot;,&quot;name&quot;:&quot;lunch&quot;,&quot;age&quot;:13},  		{&quot;id&quot;:&quot;3&quot;,&quot;number&quot;:&quot;jocelyn&quot;,&quot;name&quot;:&quot;rental&quot;,&quot;age&quot;:15},  		{&quot;id&quot;:&quot;4&quot;,&quot;number&quot;:&quot;megan&quot;,&quot;name&quot;:&quot;sds&quot;,&quot;age&quot;:15}
             ]  }

How to put the correct output in a json file

Can you help me achieve this?

data = [(1,12,&quot;smith&quot;, &quot;uber&quot;),
         (2,13,&quot;jon&quot;,&quot;lunch&quot;),
         (3,15,&quot;jocelyn&quot;,&quot;rental&quot;),
         (4,15,&quot;megan&quot;,&quot;sds&quot;)
         ]
 
 
 schema = StructType([
 StructField(&#39;id&#39;, IntegerType(), True),
 StructField(&#39;age&#39;, IntegerType(), True),
 StructField(&#39;number&#39;, StringType(), True),
 StructField(&#39;name&#39;, StringType(), True)
                     ])

 df = spark.createDataFrame(data,schema)
 
 df.show(truncate=False)

 df = df.withColumn(&quot;NewData&quot;, F.lit(&quot;NewData&quot;))
 
 df2 = df.groupBy(&#39;NewData&#39;).agg(F.collect_list(
 F.to_json(F.struct(&#39;id&#39;,&#39;number&#39;, &#39;name&#39;, &#39;age&#39;))
                           ).alias(&#39;values&#39;)
                 ))
 
 df2.show(truncate=False)

答案1

得分: 0

groupBy -> 根据常量值
agg() -> 使用别名 Newdata

from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1, 12, "smith", "uber"),
        (2, 13, "jon", "lunch"),
        (3, 15, "jocelyn", "rental"),
        (4, 15, "megan", "sds")
        ]


schema = StructType([
 StructField('id', IntegerType(), True),
 StructField('age', IntegerType(), True),
 StructField('number', StringType(), True),
 StructField('name', StringType(), True)
                     ])

df = spark.createDataFrame(data, schema)

df.show(truncate=False)

df2 = df.groupBy(lit(1)).agg(collect_list(struct('id', 'number', 'name', 'age')).alias('NewData')).\
  drop("1")

df2.write.mode("overwrite").format("json").save("<directory_path>")

print(dbutils.fs.head("<file_path>"))
# {"NewData":[{"id":1,"number":"smith","name":"uber","age":12},{"id":2,"number":"jon","name":"lunch","age":13},{"id":3,"number":"jocelyn","name":"rental","age":15},{"id":4,"number":"megan","name":"sds","age":15}]}

英文:

You don't have to use to_json function as this function results string json object.

groupBy -> on constant value
agg() -> with alias name as Newdata

Example:

from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1,12,&quot;smith&quot;, &quot;uber&quot;),
         (2,13,&quot;jon&quot;,&quot;lunch&quot;),
         (3,15,&quot;jocelyn&quot;,&quot;rental&quot;),
         (4,15,&quot;megan&quot;,&quot;sds&quot;)
         ]
 
 
schema = StructType([
 StructField(&#39;id&#39;, IntegerType(), True),
 StructField(&#39;age&#39;, IntegerType(), True),
 StructField(&#39;number&#39;, StringType(), True),
 StructField(&#39;name&#39;, StringType(), True)
                     ])

df = spark.createDataFrame(data,schema)
 
df.show(truncate=False)
     
df2 = df.groupBy(lit(1)).agg(collect_list(struct(&#39;id&#39;,&#39;number&#39;, &#39;name&#39;, &#39;age&#39;)).alias(&#39;NewData&#39;)).\
  drop(&quot;1&quot;)
 
df2.write.mode(&quot;overwrite&quot;).format(&quot;json&quot;).save(&quot;&lt;directory_path&gt;&quot;)

print(dbutils.fs.head(&quot;&lt;file_path&gt;&quot;))
#{&quot;NewData&quot;:[{&quot;id&quot;:1,&quot;number&quot;:&quot;smith&quot;,&quot;name&quot;:&quot;uber&quot;,&quot;age&quot;:12},{&quot;id&quot;:2,&quot;number&quot;:&quot;jon&quot;,&quot;name&quot;:&quot;lunch&quot;,&quot;age&quot;:13},{&quot;id&quot;:3,&quot;number&quot;:&quot;jocelyn&quot;,&quot;name&quot;:&quot;rental&quot;,&quot;age&quot;:15},{&quot;id&quot;:4,&quot;number&quot;:&quot;megan&quot;,&quot;name&quot;:&quot;sds&quot;,&quot;age&quot;:15}]}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark：创建一个嵌套的JSON文件

问题

答案1

执行内存在本地模式下运行PySpark时如何确定的？

如何将pyspark（在本地模式下）连接到bigquery？

如何在PySpark中旋转两列

将JSON中的额外字段”Struc”解析为Pyspark中的单独列。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论