将JSON中的额外字段”Struc”解析为Pyspark中的单独列。

huangapple go评论91阅读模式
英文:

Parse additional fields Struc from JSON into separate columns in Pyspark

问题

我有一个JSON文件,其中有一个名为"AdditionalFields"的字段,如下所示:

"additionalFields":
[
   {
      "fieldName":"customer_name",
      "fieldValue":"ABC"
   },
   {
      "fieldName":"deviceid",
      "fieldValue":"1234"
   },
   {
      "fieldName":"txn_id",
      "fieldValue":"2"
   },
   {
      "fieldName":"txn_date",
      "fieldValue":"2017-08-14T18:17:37"
   },
   {
      "fieldName":"orderid",
      "fieldValue":"I126101"
   }
]

如何将其解析为单独的列?例如,"customer_name"应该成为一列,而"ABC"应该成为该列的值。

尝试将其解析为ArrayType,但会得到"FieldName"和"FieldValue"两列的多行。
想要将FieldName下的每个项目作为一列,FieldValue作为相应的列值。

英文:

I have a JSON file with a field named "AdditionalFields" as below-
"additionalFields":

[
   {
      "fieldName":"customer_name",
      "fieldValue":"ABC"
   },
   {
      "fieldName":"deviceid",
      "fieldValue":"1234"
   },
   {
      "fieldName":"txn_id",
      "fieldValue":"2"
   },
   {
      "fieldName":"txn_date",
      "fieldValue":"2017-08-14T18:17:37"
   },
   {
      "fieldName":"orderid",
      "fieldValue":"I126101"
   }
]

How to parse this as separate columns? eg customer name to be a column and ABC should be the value.

Tried to parse this as an ArrayType but getting multiple rows for columns "FieldName" and "FieldValue".
Want to get each item under FieldName to be a column & FieldValue to be the respective column value.

答案1

得分: 1

根据您的JSON文件的大小,您还可以使用json库打开它,并通过处理字典来创建DataFrame数据:

# 假设您可以使用Python的json库加载数据。
data = [
    {"fieldName": "customer_name", "fieldValue": "ABC"},
    {"fieldName": "deviceid", "fieldvalue": "1234"},
    {"fieldName": "txn_id", "fieldValue": "2"},
    {"fieldName": "txn_date", "fieldValue": "2017-08-14T18:17:37"},
    {"fieldName": "orderid", "fieldValue": "I126101"},
]

df_data = [{d["fieldName"]: d["fieldValue"]} for d in data]

df = spark.createDataFrame(df_data)

希望这对您有所帮助。

英文:

Depending on the size of your JSON, you can also open it using the json library and create the DataFrame data by working on the dictionaries:

# Assuming you can load the data using the json python library.
data = [
    {"fieldName": "customer_name", "fieldValue": "ABC"},
    {"fieldName": "deviceid", "fieldValue": "1234"},
    {"fieldName": "txn_id", "fieldValue": "2"},
    {"fieldName": "txn_date", "fieldValue": "2017-08-14T18:17:37"},
    {"fieldName": "orderid", "fieldValue": "I126101"},
]

df_data = [{d["fieldName"]: d["fieldValue"]} for d in data]

df = spark.createDataFrame(df_data)

答案2

得分: 0

  1. 使用Spark的read方法读取样本JSON文件
df = spark.read.options(multiLine=True).json("path/to/sample.json")
  1. 此函数按列提取数据
def return_result(df, column):
    return df.select(column).rdd.map(lambda row: row[column]).collect()
  1. 在所有列上使用上述函数
records = [return_result(df, field) for field in df.columns]
  1. 第一条记录是标题,其余都是数据
columns, data = records[0], records[1:]
  1. 使用收集的数据创建DataFrame
converted_df = sc.parallelize(data).toDF(columns)
converted_df.show()

输出:

+-------------+--------+------+-------------------+-------+
|customer_name|deviceid|txn_id|           txn_date|orderid|
+-------------+--------+------+-------------------+-------+
|          ABC|    1234|     2|2017-08-14T18:17:37|I126101|
+-------------+--------+------+-------------------+-------+
英文:

I hope this is how your sample data looks like:

[
{"fieldName":"customer_name","fieldValue":"ABC"},
{"fieldName":"deviceid","fieldValue":"1234"},
{"fieldName":"txn_id","fieldValue":"2"},
{"fieldName":"txn_date","fieldValue":"2017-08-14T18:17:37"},
{"fieldName":"orderid","fieldValue":"I126101"}
]
  1. Read the sample JSON file using Spark read method
df=spark.read.options(multiLine=True).json("path/to/sample.json")
  1. This function extracts data column wise
def return_result(df, column):
    return df.select(column).rdd.map(lambda row: row[f"{column}"]).collect()
  1. Using the above function on all the columns
records = [return_result(df, field) for field in df.columns]

This is how it looks like:

[['customer_name', 'deviceid', 'txn_id', 'txn_date', 'orderid'], ['ABC', '1234', '2', '2017-08-14T18:17:37', 'I126101']]
  1. The first record is the header, and the rest of them are data
columns, data = records[0], records[1:]
  1. Use the colelcted and create a DataFrame out of it
converted_df = sc.parallelize(data).toDF(columns)
converted_df.show()

Output:

+-------------+--------+------+-------------------+-------+
|customer_name|deviceid|txn_id|           txn_date|orderid|
+-------------+--------+------+-------------------+-------+
|          ABC|    1234|     2|2017-08-14T18:17:37|I126101|
+-------------+--------+------+-------------------+-------+

huangapple
  • 本文由 发表于 2023年3月3日 19:13:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626365.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定