英文:
pyspark parsing nested json ignoring all the key
问题
I have the single-line JSON. trying to parse and store using Pyspark
Raw file content of 'path.json':
{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}
code in pyspark:
df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
df.show(1, False)
+------+--------------------+
|number|tool |
+------+--------------------+
|34 |{null, temp, 2.13.1}|
+------+--------------------+
df.select('tool').show(1, False)
+--------------------+
|tool |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+
if you see the structure of the key "tool" is {null, temp, 2.13:1}
and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}
.
Storing to database:
df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))
df2.write \
.format("") \
.option("url", url) \
.option("dbtable", "temp") \
.option("tempdir", "path") \
.mode("append") \
.save()
I want this to store like:
{"name": "temp", "guid": null, "version": "2.13;1"}
英文:
I have the single-line JSON. trying to parse and store using Pyspark
Raw file content of 'path.json'
{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}
code in pyspark
>>> df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
>>> df.show(1,False)
+------+--------------------+
|number|tool |
+------+--------------------+
|34 |{null, temp, 2.13.1}|
+------+--------------------+
>>> df.select('tool').show(1,False)
+--------------------+
|tool |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+
>>>
if you see the structure of the key "tool" is {null, temp, 2.13:1}
and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}
.
Storing to database
df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))
df2.write \
.format("") \
.option("url", url) \
.option("dbtable", "temp") \
.option("tempdir", "path") \
.mode("append") \
.save()
I want this to store like
{"name": "temp", "guid": null, "version": "2.13;1"}
答案1
得分: 0
以下是翻译好的部分:
AFAIU,你需要将tool
列转化为字符串,你可以使用to_json
或concat
函数来实现。
tool
列已经是一个包含键和值的Struct
,但将Struct
转化为字符串只包含值的列表。
一个更通用的选项是使用to_json
将其转化为JSON格式
from pyspark.sql.functions import to_json
df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)
+------+--------------------+----------------------------------------------+
|number|tool |toolStr |
+------+--------------------+----------------------------------------------+
|34 |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
+------+--------------------+----------------------------------------------+
如果你需要在转化Struct
时使用自定义格式,那么可以使用concat
函数
from pyspark.sql.functions import concat, lit, when, isnull, col
df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
df.withColumn('toolStr', concat(\
lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}'))\
).show(1,False)
+------+--------------------+---------------------------------------------+
|number|tool |toolStr |
+------+--------------------+---------------------------------------------+
|34 |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
+------+--------------------+---------------------------------------------+
英文:
AFAIU, what you need is to stringify the tool
column, you can do that using the to_json
or concat
functions.
The tool
column is already a Struct
with key and value but packing the Struct into string takes only the list of values.
A more generic option to dump it as a json is to use to_json
from pyspark.sql.functions import to_json
df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)
+------+--------------------+----------------------------------------------+
|number|tool |toolStr |
+------+--------------------+----------------------------------------------+
|34 |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
+------+--------------------+----------------------------------------------+
If you need a custom format while dumping the Struct, then you can use concat
function
from pyspark.sql.functions import concat, lit, when, isnull, col
df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
df.withColumn('toolStr', concat(\
lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}')\
)).show(1,False)
+------+--------------------+---------------------------------------------+
|number|tool |toolStr |
+------+--------------------+---------------------------------------------+
|34 |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
+------+--------------------+---------------------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论