2023年6月16日 15:54:39go评论146阅读模式

英文:

pyspark parsing nested json ignoring all the key

问题

I have the single-line JSON. trying to parse and store using Pyspark

Raw file content of 'path.json':

{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}

code in pyspark:

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
df.show(1, False)
+------+--------------------+
|number|tool                |
+------+--------------------+
|34    |{null, temp, 2.13.1}|
+------+--------------------+

df.select('tool').show(1, False)
+--------------------+
|tool                |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+

if you see the structure of the key "tool" is {null, temp, 2.13:1} and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}.

Storing to database:

df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))

df2.write \
    .format("") \
    .option("url", url) \
    .option("dbtable", "temp") \
    .option("tempdir", "path") \
    .mode("append") \
    .save()

I want this to store like:

{"name": "temp", "guid": null, "version": "2.13;1"}

英文:

I have the single-line JSON. trying to parse and store using Pyspark

Raw file content of 'path.json'

{&quot;number&quot;: 34, &quot;tool&quot;: {&quot;name&quot;: &quot;temp&quot;, &quot;guid&quot;: null, &quot;version&quot;: &quot;2.13:1&quot;}}

code in pyspark

&gt;&gt;&gt; df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json(&#39;path.json&#39;)
&gt;&gt;&gt; df.show(1,False)
+------+--------------------+
|number|tool                |
+------+--------------------+
|34    |{null, temp, 2.13.1}|
+------+--------------------+


&gt;&gt;&gt; df.select(&#39;tool&#39;).show(1,False)
+--------------------+
|tool                |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+

&gt;&gt;&gt;

Storing to database

df2 = df_with_ts.withColumn(&#39;tool&#39;, col(&#39;tool&#39;).cast(&#39;string&#39;))

df2.write \
    .format(&quot;&quot;) \
    .option(&quot;url&quot;, url) \
    .option(&quot;dbtable&quot;, &quot;temp&quot;) \
    .option(&quot;tempdir&quot;, &quot;path&quot;) \
    .mode(&quot;append&quot;) \
    .save()

I want this to store like

{&quot;name&quot;: &quot;temp&quot;, &quot;guid&quot;: null, &quot;version&quot;: &quot;2.13;1&quot;}

答案1

得分: 0

以下是翻译好的部分：

AFAIU，你需要将tool列转化为字符串，你可以使用to_json或concat函数来实现。

tool列已经是一个包含键和值的Struct，但将Struct转化为字符串只包含值的列表。

一个更通用的选项是使用to_json将其转化为JSON格式

from pyspark.sql.functions import to_json

df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)

+------+--------------------+----------------------------------------------+
|number|tool                |toolStr                                       |
+------+--------------------+----------------------------------------------+
|34    |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
+------+--------------------+----------------------------------------------+

如果你需要在转化Struct时使用自定义格式，那么可以使用concat函数

from pyspark.sql.functions import concat, lit, when, isnull, col

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')

df.withColumn('toolStr', concat(\
    lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
    lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
    lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}'))\
).show(1,False)

+------+--------------------+---------------------------------------------+
|number|tool                |toolStr                                      |
+------+--------------------+---------------------------------------------+
|34    |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
+------+--------------------+---------------------------------------------+

英文:

AFAIU, what you need is to stringify the tool column, you can do that using the to_json or concat functions.

The tool column is already a Struct with key and value but packing the Struct into string takes only the list of values.

A more generic option to dump it as a json is to use to_json

from pyspark.sql.functions import to_json

df.withColumn(&#39;toolStr&#39;, to_json(&#39;tool&#39;, options={&quot;ignoreNullFields&quot;:False})).show(10, False)

+------+--------------------+----------------------------------------------+
|number|tool                |toolStr                                       |
+------+--------------------+----------------------------------------------+
|34    |{null, temp, 2.13:1}|{&quot;guid&quot;:null,&quot;name&quot;:&quot;temp&quot;,&quot;version&quot;:&quot;2.13:1&quot;}|
+------+--------------------+----------------------------------------------+

If you need a custom format while dumping the Struct, then you can use concat function

from pyspark.sql.functions import concat, lit, when, isnull, col

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json(&#39;path.json&#39;)

df.withColumn(&#39;toolStr&#39;, concat(\
    lit(&#39;{&quot;name&quot;: &#39;), when(isnull(&#39;tool.name&#39;), lit(&#39;&quot;&quot;&#39;)).otherwise(col(&#39;tool.name&#39;)),\
    lit(&#39;, &quot;guid&quot;: &#39;), when(isnull(&#39;tool.guid&#39;), lit(&#39;&quot;&quot;&#39;)).otherwise(col(&#39;tool.guid&#39;)),\
    lit(&#39;, &quot;version&quot;: &#39;), when(isnull(&#39;tool.version&#39;), lit(&#39;&quot;&quot;&#39;)).otherwise(col(&#39;tool.version&#39;)), lit(&#39;}&#39;)\
)).show(1,False)

+------+--------------------+---------------------------------------------+
|number|tool                |toolStr                                      |
+------+--------------------+---------------------------------------------+
|34    |{null, temp, 2.13:1}|{&quot;name&quot;: temp, &quot;guid&quot;: &quot;&quot;, &quot;version&quot;: 2.13:1}|
+------+--------------------+---------------------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pyspark 解析嵌套的 JSON，忽略所有键。

问题

答案1

高效快速搜索字典中的字典的方法

Apache Spark spark-submit k8s API https ERROR

Sympy solve 在添加正数限制后返回空集。

Pandas按季度和公司统计员工人数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论