pyspark 解析嵌套的 JSON,忽略所有键。

huangapple go评论67阅读模式
英文:

pyspark parsing nested json ignoring all the key

问题

I have the single-line JSON. trying to parse and store using Pyspark

Raw file content of 'path.json':

{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}

code in pyspark:

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
df.show(1, False)
+------+--------------------+
|number|tool                |
+------+--------------------+
|34    |{null, temp, 2.13.1}|
+------+--------------------+

df.select('tool').show(1, False)
+--------------------+
|tool                |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+

if you see the structure of the key "tool" is {null, temp, 2.13:1} and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}.

Storing to database:

df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))

df2.write \
    .format("") \
    .option("url", url) \
    .option("dbtable", "temp") \
    .option("tempdir", "path") \
    .mode("append") \
    .save()

I want this to store like:

{"name": "temp", "guid": null, "version": "2.13;1"}
英文:

I have the single-line JSON. trying to parse and store using Pyspark

Raw file content of 'path.json'

{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}

code in pyspark

>>> df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
>>> df.show(1,False)
+------+--------------------+
|number|tool                |
+------+--------------------+
|34    |{null, temp, 2.13.1}|
+------+--------------------+


>>> df.select('tool').show(1,False)
+--------------------+
|tool                |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+

>>>

if you see the structure of the key "tool" is {null, temp, 2.13:1} and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}.

Storing to database

df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))

df2.write \
    .format("") \
    .option("url", url) \
    .option("dbtable", "temp") \
    .option("tempdir", "path") \
    .mode("append") \
    .save()
   

I want this to store like

{"name": "temp", "guid": null, "version": "2.13;1"}

答案1

得分: 0

以下是翻译好的部分:

AFAIU,你需要将tool列转化为字符串,你可以使用to_jsonconcat函数来实现。

tool列已经是一个包含键和值的Struct,但将Struct转化为字符串只包含值的列表。

一个更通用的选项是使用to_json将其转化为JSON格式

from pyspark.sql.functions import to_json

df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)
+------+--------------------+----------------------------------------------+
|number|tool                |toolStr                                       |
+------+--------------------+----------------------------------------------+
|34    |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
+------+--------------------+----------------------------------------------+

如果你需要在转化Struct时使用自定义格式,那么可以使用concat函数

from pyspark.sql.functions import concat, lit, when, isnull, col

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')

df.withColumn('toolStr', concat(\
    lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
    lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
    lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}'))\
).show(1,False)
+------+--------------------+---------------------------------------------+
|number|tool                |toolStr                                      |
+------+--------------------+---------------------------------------------+
|34    |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
+------+--------------------+---------------------------------------------+
英文:

AFAIU, what you need is to stringify the tool column, you can do that using the to_json or concat functions.

The tool column is already a Struct with key and value but packing the Struct into string takes only the list of values.

A more generic option to dump it as a json is to use to_json

from pyspark.sql.functions import to_json

df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)
+------+--------------------+----------------------------------------------+
|number|tool                |toolStr                                       |
+------+--------------------+----------------------------------------------+
|34    |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
+------+--------------------+----------------------------------------------+

If you need a custom format while dumping the Struct, then you can use concat function

from pyspark.sql.functions import concat, lit, when, isnull, col

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')

df.withColumn('toolStr', concat(\
    lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
    lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
    lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}')\
)).show(1,False)
+------+--------------------+---------------------------------------------+
|number|tool                |toolStr                                      |
+------+--------------------+---------------------------------------------+
|34    |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
+------+--------------------+---------------------------------------------+

huangapple
  • 本文由 发表于 2023年6月16日 15:54:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76488075.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定