英文:
Add selected columns from complex exploding dataframe to another dataframe in pyspark
问题
作为示例数据,我有以下内容:
{
"GeneralInformation":{
"ID":"00001",
"WebLinksInfo":{
"LastUpdated":"2019-10-27",
"WebSite":{
"Type":"Home Page",
"text":"https://www.aaaa.com/"
}
},
"TextInfo":{
"Text":[
{
"Type":"Business",
"updated_at":"2018-09-14",
"unused_field":"en-US",
"Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
},
{
"Type":"Financial",
"updated_at":"2022-08-26",
"unused_field":"en-US",
"Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
}
]
},
"Advisors":{
"Auditor":{
"Code":"AAA",
"Name":"Aristotle"
}
}
}
}
我有这个模式的数据框:
root
|-- GeneralInformation: struct (nullable = true)
| |-- Advisors: struct (nullable = true)
| | |-- Auditor: struct (nullable = true)
| | | |-- Code: string (nullable = true)
| | | |-- Name: string (nullable = true)
| |-- ID: string (nullable = true)
| |-- TextInfo: struct (nullable = true)
| | |-- Text: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Description: string (nullable = true)
| | | | |-- updated_at: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | | | |-- unused_field: string (nullable = true)
| |-- WebLinksInfo: struct (nullable = true)
| | |-- LastUpdated: string (nullable = true)
| | |-- WebSite: struct (nullable = true)
| | | |-- text: string (nullable = true)
| | | |-- Type: string (nullable = true)
我只对提取ID、TextInfo.Text.Description和updated_at感兴趣。
现在我像这样提取ID:
df = spark.read.json('my_file_path')
new_df = df.select(col('GeneralInformation.ID'))
new_df = new_df.join(df.select(col('GeneralInformation.TextInfo.Text')))
得到以下模式,但我希望只在根级别上有ID、Description和Updated_at,并且只有一条记录,无论Text数组中有多少个描述,根据Type值,如果Type值为'Business',则获取该ID的描述和updated_at。
root
|-- ID: string (nullable = true)
|-- Text: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Description: string (nullable = true)
| | |-- updated_at: string (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- unused_field: string (nullable = true)
期望的输出为:
{
"ID": "00001",
"Description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.",
"Updated_at": "2018-09-14"
}
例如,如果我使用:
df = df.withColumn('description', F.when(
df['Text'][0]['Type'] == 'Business', lit(df['Text'][0]['Description'])))
我会得到一个新的列作为数据框中的描述,我可以保留并删除TextInfo,但它不能保证我会查找Text数组的其他元素。
英文:
As sample data I have:
{
"GeneralInformation":{
"ID":"00001",
"WebLinksInfo":{
"LastUpdated":"2019-10-27",
"WebSite":{
"Type":"Home Page",
"text":"https://www.aaaa.com/"
}
},
"TextInfo":{
"Text":[
{
"Type":"Business",
"updated_at":"2018-09-14",
"unused_field":"en-US",
"Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
},
{
"Type":"Financial",
"updated_at":"2022-08-26",
"unused_field":"en-US",
"Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
}
]
},
"Advisors":{
"Auditor":{
"Code":"AAA",
"Name":"Aristotle"
}
}
}
}
I have this schema dataframe:
root
|-- GeneralInformation: struct (nullable = true)
| |-- Advisors: struct (nullable = true)
| | |-- Auditor: struct (nullable = true)
| | | |-- Code: string (nullable = true)
| | | |-- Name: string (nullable = true)
| |-- ID: string (nullable = true)
| |-- TextInfo: struct (nullable = true)
| | |-- Text: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Description: string (nullable = true)
| | | | |-- updated_at: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | | | |-- unused_field: string (nullable = true)
| |-- WebLinksInfo: struct (nullable = true)
| | |-- LastUpdated: string (nullable = true)
| | |-- WebSite: struct (nullable = true)
| | | |-- text: string (nullable = true)
| | | |-- Type: string (nullable = true)
I am just interest on extracting ID and TextInfo.Text.Description and updated_at
Right now I extract the ID like this:
df = spark.read.json('my_file_path')
new_df = df.select(col('GeneralInformation.ID'))
new_df = new_df.join(df.select(col('GeneralInformation.TextInfo.Text')))
End up with this schema, but I want to have just ID, Description and Updated_at on root Level, and just 1 record no matter how many descriptions exist on the Text array based on the Type value, so if Type value is 'Business', get description and updated_at for that ID.
root
|-- ID: string (nullable = true)
|-- Text: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Description: string (nullable = true)
| | |-- updated_at: string (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- unused_field: string (nullable = true)
Expected output:
{
"ID": "00001",
"Description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.",
"Updated_at": "2018-09-14"
}
For example, if I use:
df = df.withColumn('description', F.when(
df['Text'][0]['Type'] == 'Business', lit(df['Text'][0]['Description'])))
I get the description out as a new column on the dataframe that I can keep and delete TextInfo, but it won't assure me that it will look on other elements of the Text array.
答案1
得分: 1
尝试通过索引访问**array
**。
示例:
from pyspark.sql.functions import *
js_str = """{
"GeneralInformation":{
"ID":"00001",
"WebLinksInfo":{
"LastUpdated":"2019-10-27",
"WebSite":{
"Type":"Home Page",
"text":"https://www.aaaa.com/"
}
},
"TextInfo":{
"Text":[
{
"Type":"Business",
"updated_at":"2018-09-14",
"unused_field":"en-US",
"Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
},
{
"Type":"Financial",
"updated_at":"2022-08-26",
"unused_field":"en-US",
"Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
}
]
},
"Advisors":{
"Auditor":{
"Code":"AAA",
"Name":"Aristotle"
}
}
}
}"""
df = spark.read.json(sc.parallelize([js_str]), multiLine=True)
df.select(col('GeneralInformation.ID'),
col('GeneralInformation.TextInfo.Text.Description')[0].alias("Description"),
col('GeneralInformation.TextInfo.Text.updated_at')[0].alias("updated_at")).\
show(10,False)
#+-----+-----------------------------------------------------------------+----------+
#|ID |Description |updated_at|
#+-----+-----------------------------------------------------------------+----------+
#|00001|Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.|2018-09-14|
#+-----+-----------------------------------------------------------------+----------+
英文:
Try by accessing array
with the index.
Example:
from pyspark.sql.functions import *
js_str = """{
"GeneralInformation":{
"ID":"00001",
"WebLinksInfo":{
"LastUpdated":"2019-10-27",
"WebSite":{
"Type":"Home Page",
"text":"https://www.aaaa.com/"
}
},
"TextInfo":{
"Text":[
{
"Type":"Business",
"updated_at":"2018-09-14",
"unused_field":"en-US",
"Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
},
{
"Type":"Financial",
"updated_at":"2022-08-26",
"unused_field":"en-US",
"Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
}
]
},
"Advisors":{
"Auditor":{
"Code":"AAA",
"Name":"Aristotle"
}
}
}
}"""
df = spark.read.json(sc.parallelize([js_str]), multiLine=True)
df.select(col('GeneralInformation.ID'),
col('GeneralInformation.TextInfo.Text.Description')[0].alias("Description"),
col('GeneralInformation.TextInfo.Text.updated_at')[0].alias("updated_at")).\
show(10,False)
#+-----+-----------------------------------------------------------------+----------+
#|ID |Description |updated_at|
#+-----+-----------------------------------------------------------------+----------+
#|00001|Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.|2018-09-14|
#+-----+-----------------------------------------------------------------+----------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论