将复杂的爆炸数据帧中的选定列添加到另一个PySpark数据帧中。

huangapple go评论81阅读模式
英文:

Add selected columns from complex exploding dataframe to another dataframe in pyspark

问题

作为示例数据,我有以下内容:

{
    "GeneralInformation":{
        "ID":"00001",
        "WebLinksInfo":{
            "LastUpdated":"2019-10-27",
            "WebSite":{
                "Type":"Home Page",
                "text":"https://www.aaaa.com/"
            }
        },
        "TextInfo":{
            "Text":[
                {
                    "Type":"Business",
                    "updated_at":"2018-09-14",
                    "unused_field":"en-US",
                    "Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
                },
                {
                    "Type":"Financial",
                    "updated_at":"2022-08-26",
                    "unused_field":"en-US",
                    "Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
                }
            ]
        },
        "Advisors":{
            "Auditor":{
                "Code":"AAA",
                "Name":"Aristotle"
            }
        }
    }
}

我有这个模式的数据框:

root
 |-- GeneralInformation: struct (nullable = true)
 |    |-- Advisors: struct (nullable = true)
 |    |    |-- Auditor: struct (nullable = true)
 |    |    |    |-- Code: string (nullable = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |-- ID: string (nullable = true)
 |    |-- TextInfo: struct (nullable = true)
 |    |    |-- Text: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Description: string (nullable = true)
 |    |    |    |    |-- updated_at: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |    |-- unused_field: string (nullable = true)
 |    |-- WebLinksInfo: struct (nullable = true)
 |    |    |-- LastUpdated: string (nullable = true)
 |    |    |-- WebSite: struct (nullable = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)

我只对提取ID、TextInfo.Text.Description和updated_at感兴趣。

现在我像这样提取ID:

df = spark.read.json('my_file_path')

new_df = df.select(col('GeneralInformation.ID'))
new_df = new_df.join(df.select(col('GeneralInformation.TextInfo.Text')))

得到以下模式,但我希望只在根级别上有ID、Description和Updated_at,并且只有一条记录,无论Text数组中有多少个描述,根据Type值,如果Type值为'Business',则获取该ID的描述和updated_at。

root
 |-- ID: string (nullable = true)
 |-- Text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- unused_field: string (nullable = true)

期望的输出为:

{
    "ID": "00001",
    "Description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.",
    "Updated_at": "2018-09-14"
}

例如,如果我使用:

df = df.withColumn('description', F.when(
    df['Text'][0]['Type'] == 'Business', lit(df['Text'][0]['Description'])))

我会得到一个新的列作为数据框中的描述,我可以保留并删除TextInfo,但它不能保证我会查找Text数组的其他元素。

英文:

As sample data I have:

{
    "GeneralInformation":{
        "ID":"00001",
        "WebLinksInfo":{
            "LastUpdated":"2019-10-27",
            "WebSite":{
                "Type":"Home Page",
                "text":"https://www.aaaa.com/"
            }
        },
        "TextInfo":{
            "Text":[
                {
                    "Type":"Business",
                    "updated_at":"2018-09-14",
                    "unused_field":"en-US",
                    "Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
                },
                {
                    "Type":"Financial",
                    "updated_at":"2022-08-26",
                    "unused_field":"en-US",
                    "Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
                }
            ]
        },
        "Advisors":{
            "Auditor":{
                "Code":"AAA",
                "Name":"Aristotle"
            }
        }
    }

}

I have this schema dataframe:

root
 |-- GeneralInformation: struct (nullable = true)
 |    |-- Advisors: struct (nullable = true)
 |    |    |-- Auditor: struct (nullable = true)
 |    |    |    |-- Code: string (nullable = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |-- ID: string (nullable = true)
 |    |-- TextInfo: struct (nullable = true)
 |    |    |-- Text: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Description: string (nullable = true)
 |    |    |    |    |-- updated_at: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |    |-- unused_field: string (nullable = true)
 |    |-- WebLinksInfo: struct (nullable = true)
 |    |    |-- LastUpdated: string (nullable = true)
 |    |    |-- WebSite: struct (nullable = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)

I am just interest on extracting ID and TextInfo.Text.Description and updated_at

Right now I extract the ID like this:

    df = spark.read.json('my_file_path')

    new_df = df.select(col('GeneralInformation.ID'))
    new_df = new_df.join(df.select(col('GeneralInformation.TextInfo.Text')))

End up with this schema, but I want to have just ID, Description and Updated_at on root Level, and just 1 record no matter how many descriptions exist on the Text array based on the Type value, so if Type value is 'Business', get description and updated_at for that ID.

root
 |-- ID: string (nullable = true)
 |-- Text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- unused_field: string (nullable = true)

Expected output:

{
    "ID": "00001",
    "Description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.",
    "Updated_at": "2018-09-14"
}

For example, if I use:

df = df.withColumn('description', F.when(
    df['Text'][0]['Type'] == 'Business', lit(df['Text'][0]['Description'])))

I get the description out as a new column on the dataframe that I can keep and delete TextInfo, but it won't assure me that it will look on other elements of the Text array.

答案1

得分: 1

尝试通过索引访问**array**。

示例:

from pyspark.sql.functions import *
js_str = """{
    "GeneralInformation":{
        "ID":"00001",
        "WebLinksInfo":{
            "LastUpdated":"2019-10-27",
            "WebSite":{
                "Type":"Home Page",
                "text":"https://www.aaaa.com/"
            }
        },
        "TextInfo":{
            "Text":[
                {
                    "Type":"Business",
                    "updated_at":"2018-09-14",
                    "unused_field":"en-US",
                    "Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
                },
                {
                    "Type":"Financial",
                    "updated_at":"2022-08-26",
                    "unused_field":"en-US",
                    "Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
                }
            ]
        },
        "Advisors":{
            "Auditor":{
                "Code":"AAA",
                "Name":"Aristotle"
            }
        }
    }

}"""

df = spark.read.json(sc.parallelize([js_str]), multiLine=True)

df.select(col('GeneralInformation.ID'),
          col('GeneralInformation.TextInfo.Text.Description')[0].alias("Description"),
          col('GeneralInformation.TextInfo.Text.updated_at')[0].alias("updated_at")).\
            show(10,False)
#+-----+-----------------------------------------------------------------+----------+
#|ID   |Description                                                      |updated_at|
#+-----+-----------------------------------------------------------------+----------+
#|00001|Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.|2018-09-14|
#+-----+-----------------------------------------------------------------+----------+
英文:

Try by accessing array with the index.

Example:

from pyspark.sql.functions import *
js_str = """{
    "GeneralInformation":{
        "ID":"00001",
        "WebLinksInfo":{
            "LastUpdated":"2019-10-27",
            "WebSite":{
                "Type":"Home Page",
                "text":"https://www.aaaa.com/"
            }
        },
        "TextInfo":{
            "Text":[
                {
                    "Type":"Business",
                    "updated_at":"2018-09-14",
                    "unused_field":"en-US",
                    "Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
                },
                {
                    "Type":"Financial",
                    "updated_at":"2022-08-26",
                    "unused_field":"en-US",
                    "Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
                }
            ]
        },
        "Advisors":{
            "Auditor":{
                "Code":"AAA",
                "Name":"Aristotle"
            }
        }
    }

}"""

df = spark.read.json(sc.parallelize([js_str]), multiLine=True)

df.select(col('GeneralInformation.ID'),
          col('GeneralInformation.TextInfo.Text.Description')[0].alias("Description"),
          col('GeneralInformation.TextInfo.Text.updated_at')[0].alias("updated_at")).\
            show(10,False)
#+-----+-----------------------------------------------------------------+----------+
#|ID   |Description                                                      |updated_at|
#+-----+-----------------------------------------------------------------+----------+
#|00001|Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.|2018-09-14|
#+-----+-----------------------------------------------------------------+----------+

huangapple
  • 本文由 发表于 2023年8月8日 22:59:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860800.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定