英文:
Parse additional fields Struc from JSON into separate columns in Pyspark
问题
我有一个JSON文件,其中有一个名为"AdditionalFields"的字段,如下所示:
"additionalFields":
[
{
"fieldName":"customer_name",
"fieldValue":"ABC"
},
{
"fieldName":"deviceid",
"fieldValue":"1234"
},
{
"fieldName":"txn_id",
"fieldValue":"2"
},
{
"fieldName":"txn_date",
"fieldValue":"2017-08-14T18:17:37"
},
{
"fieldName":"orderid",
"fieldValue":"I126101"
}
]
如何将其解析为单独的列?例如,"customer_name"应该成为一列,而"ABC"应该成为该列的值。
尝试将其解析为ArrayType,但会得到"FieldName"和"FieldValue"两列的多行。
想要将FieldName下的每个项目作为一列,FieldValue作为相应的列值。
英文:
I have a JSON file with a field named "AdditionalFields" as below-
"additionalFields":
[
{
"fieldName":"customer_name",
"fieldValue":"ABC"
},
{
"fieldName":"deviceid",
"fieldValue":"1234"
},
{
"fieldName":"txn_id",
"fieldValue":"2"
},
{
"fieldName":"txn_date",
"fieldValue":"2017-08-14T18:17:37"
},
{
"fieldName":"orderid",
"fieldValue":"I126101"
}
]
How to parse this as separate columns? eg customer name to be a column and ABC should be the value.
Tried to parse this as an ArrayType but getting multiple rows for columns "FieldName" and "FieldValue".
Want to get each item under FieldName to be a column & FieldValue to be the respective column value.
答案1
得分: 1
根据您的JSON文件的大小,您还可以使用json
库打开它,并通过处理字典来创建DataFrame数据:
# 假设您可以使用Python的json库加载数据。
data = [
{"fieldName": "customer_name", "fieldValue": "ABC"},
{"fieldName": "deviceid", "fieldvalue": "1234"},
{"fieldName": "txn_id", "fieldValue": "2"},
{"fieldName": "txn_date", "fieldValue": "2017-08-14T18:17:37"},
{"fieldName": "orderid", "fieldValue": "I126101"},
]
df_data = [{d["fieldName"]: d["fieldValue"]} for d in data]
df = spark.createDataFrame(df_data)
希望这对您有所帮助。
英文:
Depending on the size of your JSON, you can also open it using the json
library and create the DataFrame data by working on the dictionaries:
# Assuming you can load the data using the json python library.
data = [
{"fieldName": "customer_name", "fieldValue": "ABC"},
{"fieldName": "deviceid", "fieldValue": "1234"},
{"fieldName": "txn_id", "fieldValue": "2"},
{"fieldName": "txn_date", "fieldValue": "2017-08-14T18:17:37"},
{"fieldName": "orderid", "fieldValue": "I126101"},
]
df_data = [{d["fieldName"]: d["fieldValue"]} for d in data]
df = spark.createDataFrame(df_data)
答案2
得分: 0
- 使用Spark的read方法读取样本JSON文件
df = spark.read.options(multiLine=True).json("path/to/sample.json")
- 此函数按列提取数据
def return_result(df, column):
return df.select(column).rdd.map(lambda row: row[column]).collect()
- 在所有列上使用上述函数
records = [return_result(df, field) for field in df.columns]
- 第一条记录是标题,其余都是数据
columns, data = records[0], records[1:]
- 使用收集的数据创建DataFrame
converted_df = sc.parallelize(data).toDF(columns)
converted_df.show()
输出:
+-------------+--------+------+-------------------+-------+
|customer_name|deviceid|txn_id| txn_date|orderid|
+-------------+--------+------+-------------------+-------+
| ABC| 1234| 2|2017-08-14T18:17:37|I126101|
+-------------+--------+------+-------------------+-------+
英文:
I hope this is how your sample data looks like:
[
{"fieldName":"customer_name","fieldValue":"ABC"},
{"fieldName":"deviceid","fieldValue":"1234"},
{"fieldName":"txn_id","fieldValue":"2"},
{"fieldName":"txn_date","fieldValue":"2017-08-14T18:17:37"},
{"fieldName":"orderid","fieldValue":"I126101"}
]
- Read the sample JSON file using Spark read method
df=spark.read.options(multiLine=True).json("path/to/sample.json")
- This function extracts data column wise
def return_result(df, column):
return df.select(column).rdd.map(lambda row: row[f"{column}"]).collect()
- Using the above function on all the columns
records = [return_result(df, field) for field in df.columns]
This is how it looks like:
[['customer_name', 'deviceid', 'txn_id', 'txn_date', 'orderid'], ['ABC', '1234', '2', '2017-08-14T18:17:37', 'I126101']]
- The first record is the header, and the rest of them are data
columns, data = records[0], records[1:]
- Use the colelcted and create a DataFrame out of it
converted_df = sc.parallelize(data).toDF(columns)
converted_df.show()
Output:
+-------------+--------+------+-------------------+-------+
|customer_name|deviceid|txn_id| txn_date|orderid|
+-------------+--------+------+-------------------+-------+
| ABC| 1234| 2|2017-08-14T18:17:37|I126101|
+-------------+--------+------+-------------------+-------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论