2023年3月3日 19:13:28go评论187阅读模式

英文:

Parse additional fields Struc from JSON into separate columns in Pyspark

问题

我有一个JSON文件，其中有一个名为"AdditionalFields"的字段，如下所示：

"additionalFields":
[
   {
      "fieldName":"customer_name",
      "fieldValue":"ABC"
   },
   {
      "fieldName":"deviceid",
      "fieldValue":"1234"
   },
   {
      "fieldName":"txn_id",
      "fieldValue":"2"
   },
   {
      "fieldName":"txn_date",
      "fieldValue":"2017-08-14T18:17:37"
   },
   {
      "fieldName":"orderid",
      "fieldValue":"I126101"
   }
]

如何将其解析为单独的列？例如，"customer_name"应该成为一列，而"ABC"应该成为该列的值。

尝试将其解析为ArrayType，但会得到"FieldName"和"FieldValue"两列的多行。
想要将FieldName下的每个项目作为一列，FieldValue作为相应的列值。

英文:

I have a JSON file with a field named "AdditionalFields" as below-
"additionalFields":

[
   {
      &quot;fieldName&quot;:&quot;customer_name&quot;,
      &quot;fieldValue&quot;:&quot;ABC&quot;
   },
   {
      &quot;fieldName&quot;:&quot;deviceid&quot;,
      &quot;fieldValue&quot;:&quot;1234&quot;
   },
   {
      &quot;fieldName&quot;:&quot;txn_id&quot;,
      &quot;fieldValue&quot;:&quot;2&quot;
   },
   {
      &quot;fieldName&quot;:&quot;txn_date&quot;,
      &quot;fieldValue&quot;:&quot;2017-08-14T18:17:37&quot;
   },
   {
      &quot;fieldName&quot;:&quot;orderid&quot;,
      &quot;fieldValue&quot;:&quot;I126101&quot;
   }
]

How to parse this as separate columns? eg customer name to be a column and ABC should be the value.

Tried to parse this as an ArrayType but getting multiple rows for columns "FieldName" and "FieldValue".
Want to get each item under FieldName to be a column & FieldValue to be the respective column value.

答案1

得分: 1

根据您的JSON文件的大小，您还可以使用json库打开它，并通过处理字典来创建DataFrame数据：

# 假设您可以使用Python的json库加载数据。
data = [
    {"fieldName": "customer_name", "fieldValue": "ABC"},
    {"fieldName": "deviceid", "fieldvalue": "1234"},
    {"fieldName": "txn_id", "fieldValue": "2"},
    {"fieldName": "txn_date", "fieldValue": "2017-08-14T18:17:37"},
    {"fieldName": "orderid", "fieldValue": "I126101"},
]

df_data = [{d["fieldName"]: d["fieldValue"]} for d in data]

df = spark.createDataFrame(df_data)

希望这对您有所帮助。

英文:

Depending on the size of your JSON, you can also open it using the json library and create the DataFrame data by working on the dictionaries:

# Assuming you can load the data using the json python library.
data = [
    {&quot;fieldName&quot;: &quot;customer_name&quot;, &quot;fieldValue&quot;: &quot;ABC&quot;},
    {&quot;fieldName&quot;: &quot;deviceid&quot;, &quot;fieldValue&quot;: &quot;1234&quot;},
    {&quot;fieldName&quot;: &quot;txn_id&quot;, &quot;fieldValue&quot;: &quot;2&quot;},
    {&quot;fieldName&quot;: &quot;txn_date&quot;, &quot;fieldValue&quot;: &quot;2017-08-14T18:17:37&quot;},
    {&quot;fieldName&quot;: &quot;orderid&quot;, &quot;fieldValue&quot;: &quot;I126101&quot;},
]

df_data = [{d[&quot;fieldName&quot;]: d[&quot;fieldValue&quot;]} for d in data]

df = spark.createDataFrame(df_data)

答案2

得分: 0

使用Spark的read方法读取样本JSON文件

df = spark.read.options(multiLine=True).json("path/to/sample.json")

此函数按列提取数据

def return_result(df, column):
    return df.select(column).rdd.map(lambda row: row[column]).collect()

在所有列上使用上述函数

records = [return_result(df, field) for field in df.columns]

第一条记录是标题，其余都是数据

columns, data = records[0], records[1:]

使用收集的数据创建DataFrame

converted_df = sc.parallelize(data).toDF(columns)
converted_df.show()

输出:

+-------------+--------+------+-------------------+-------+
|customer_name|deviceid|txn_id|           txn_date|orderid|
+-------------+--------+------+-------------------+-------+
|          ABC|    1234|     2|2017-08-14T18:17:37|I126101|
+-------------+--------+------+-------------------+-------+

英文:

I hope this is how your sample data looks like:

[
{&quot;fieldName&quot;:&quot;customer_name&quot;,&quot;fieldValue&quot;:&quot;ABC&quot;},
{&quot;fieldName&quot;:&quot;deviceid&quot;,&quot;fieldValue&quot;:&quot;1234&quot;},
{&quot;fieldName&quot;:&quot;txn_id&quot;,&quot;fieldValue&quot;:&quot;2&quot;},
{&quot;fieldName&quot;:&quot;txn_date&quot;,&quot;fieldValue&quot;:&quot;2017-08-14T18:17:37&quot;},
{&quot;fieldName&quot;:&quot;orderid&quot;,&quot;fieldValue&quot;:&quot;I126101&quot;}
]

Read the sample JSON file using Spark read method

df=spark.read.options(multiLine=True).json(&quot;path/to/sample.json&quot;)

This function extracts data column wise

def return_result(df, column):
    return df.select(column).rdd.map(lambda row: row[f&quot;{column}&quot;]).collect()

Using the above function on all the columns

records = [return_result(df, field) for field in df.columns]

This is how it looks like:

[[&#39;customer_name&#39;, &#39;deviceid&#39;, &#39;txn_id&#39;, &#39;txn_date&#39;, &#39;orderid&#39;], [&#39;ABC&#39;, &#39;1234&#39;, &#39;2&#39;, &#39;2017-08-14T18:17:37&#39;, &#39;I126101&#39;]]

The first record is the header, and the rest of them are data

columns, data = records[0], records[1:]

Use the colelcted and create a DataFrame out of it

converted_df = sc.parallelize(data).toDF(columns)
converted_df.show()

Output:

+-------------+--------+------+-------------------+-------+
|customer_name|deviceid|txn_id|           txn_date|orderid|
+-------------+--------+------+-------------------+-------+
|          ABC|    1234|     2|2017-08-14T18:17:37|I126101|
+-------------+--------+------+-------------------+-------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将JSON中的额外字段”Struc”解析为Pyspark中的单独列。

问题

答案1

答案2

替代Unicode替换的选项为\u。

Spark任务数量不等于分区数量。

当编组JSON时，如何内联字段？

Return Entire Struct from Go function

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论