2023年8月8日 22:59:09go评论124阅读模式

英文:

Add selected columns from complex exploding dataframe to another dataframe in pyspark

问题

作为示例数据，我有以下内容：

{
    "GeneralInformation":{
        "ID":"00001",
        "WebLinksInfo":{
            "LastUpdated":"2019-10-27",
            "WebSite":{
                "Type":"Home Page",
                "text":"https://www.aaaa.com/"
            }
        },
        "TextInfo":{
            "Text":[
                {
                    "Type":"Business",
                    "updated_at":"2018-09-14",
                    "unused_field":"en-US",
                    "Description":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum."
                },
                {
                    "Type":"Financial",
                    "updated_at":"2022-08-26",
                    "unused_field":"en-US",
                    "Description":"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
                }
            ]
        },
        "Advisors":{
            "Auditor":{
                "Code":"AAA",
                "Name":"Aristotle"
            }
        }
    }
}

我有这个模式的数据框：

root
 |-- GeneralInformation: struct (nullable = true)
 |    |-- Advisors: struct (nullable = true)
 |    |    |-- Auditor: struct (nullable = true)
 |    |    |    |-- Code: string (nullable = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |-- ID: string (nullable = true)
 |    |-- TextInfo: struct (nullable = true)
 |    |    |-- Text: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Description: string (nullable = true)
 |    |    |    |    |-- updated_at: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |    |-- unused_field: string (nullable = true)
 |    |-- WebLinksInfo: struct (nullable = true)
 |    |    |-- LastUpdated: string (nullable = true)
 |    |    |-- WebSite: struct (nullable = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)

我只对提取ID、TextInfo.Text.Description和updated_at感兴趣。

现在我像这样提取ID：

df = spark.read.json('my_file_path')
new_df = df.select(col('GeneralInformation.ID'))
new_df = new_df.join(df.select(col('GeneralInformation.TextInfo.Text')))

得到以下模式，但我希望只在根级别上有ID、Description和Updated_at，并且只有一条记录，无论Text数组中有多少个描述，根据Type值，如果Type值为'Business'，则获取该ID的描述和updated_at。

root
 |-- ID: string (nullable = true)
 |-- Text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- unused_field: string (nullable = true)

期望的输出为：

{
    "ID": "00001",
    "Description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.",
    "Updated_at": "2018-09-14"
}

例如，如果我使用：

df = df.withColumn('description', F.when(
    df['Text'][0]['Type'] == 'Business', lit(df['Text'][0]['Description'])))

我会得到一个新的列作为数据框中的描述，我可以保留并删除TextInfo，但它不能保证我会查找Text数组的其他元素。

英文:

As sample data I have:

{
    &quot;GeneralInformation&quot;:{
        &quot;ID&quot;:&quot;00001&quot;,
        &quot;WebLinksInfo&quot;:{
            &quot;LastUpdated&quot;:&quot;2019-10-27&quot;,
            &quot;WebSite&quot;:{
                &quot;Type&quot;:&quot;Home Page&quot;,
                &quot;text&quot;:&quot;https://www.aaaa.com/&quot;
            }
        },
        &quot;TextInfo&quot;:{
            &quot;Text&quot;:[
                {
                    &quot;Type&quot;:&quot;Business&quot;,
                    &quot;updated_at&quot;:&quot;2018-09-14&quot;,
                    &quot;unused_field&quot;:&quot;en-US&quot;,
                    &quot;Description&quot;:&quot;Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.&quot;
                },
                {
                    &quot;Type&quot;:&quot;Financial&quot;,
                    &quot;updated_at&quot;:&quot;2022-08-26&quot;,
                    &quot;unused_field&quot;:&quot;en-US&quot;,
                    &quot;Description&quot;:&quot;Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.&quot;
                }
            ]
        },
        &quot;Advisors&quot;:{
            &quot;Auditor&quot;:{
                &quot;Code&quot;:&quot;AAA&quot;,
                &quot;Name&quot;:&quot;Aristotle&quot;
            }
        }
    }
}

I have this schema dataframe:

root
 |-- GeneralInformation: struct (nullable = true)
 |    |-- Advisors: struct (nullable = true)
 |    |    |-- Auditor: struct (nullable = true)
 |    |    |    |-- Code: string (nullable = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |-- ID: string (nullable = true)
 |    |-- TextInfo: struct (nullable = true)
 |    |    |-- Text: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Description: string (nullable = true)
 |    |    |    |    |-- updated_at: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |    |-- unused_field: string (nullable = true)
 |    |-- WebLinksInfo: struct (nullable = true)
 |    |    |-- LastUpdated: string (nullable = true)
 |    |    |-- WebSite: struct (nullable = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)

I am just interest on extracting ID and TextInfo.Text.Description and updated_at

Right now I extract the ID like this:

    df = spark.read.json(&#39;my_file_path&#39;)
    new_df = df.select(col(&#39;GeneralInformation.ID&#39;))
    new_df = new_df.join(df.select(col(&#39;GeneralInformation.TextInfo.Text&#39;)))

End up with this schema, but I want to have just ID, Description and Updated_at on root Level, and just 1 record no matter how many descriptions exist on the Text array based on the Type value, so if Type value is 'Business', get description and updated_at for that ID.

root
 |-- ID: string (nullable = true)
 |-- Text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- unused_field: string (nullable = true)

Expected output:

{
    &quot;ID&quot;: &quot;00001&quot;,
    &quot;Description&quot;: &quot;Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.&quot;,
    &quot;Updated_at&quot;: &quot;2018-09-14&quot;
}

For example, if I use:

df = df.withColumn(&#39;description&#39;, F.when(
    df[&#39;Text&#39;][0][&#39;Type&#39;] == &#39;Business&#39;, lit(df[&#39;Text&#39;][0][&#39;Description&#39;])))

I get the description out as a new column on the dataframe that I can keep and delete TextInfo, but it won't assure me that it will look on other elements of the Text array.

答案1

得分: 1

尝试通过索引访问**array**。

示例：

from pyspark.sql.functions import *
js_str = &quot;&quot;&quot;{
    &quot;GeneralInformation&quot;:{
        &quot;ID&quot;:&quot;00001&quot;,
        &quot;WebLinksInfo&quot;:{
            &quot;LastUpdated&quot;:&quot;2019-10-27&quot;,
            &quot;WebSite&quot;:{
                &quot;Type&quot;:&quot;Home Page&quot;,
                &quot;text&quot;:&quot;https://www.aaaa.com/&quot;
            }
        },
        &quot;TextInfo&quot;:{
            &quot;Text&quot;:[
                {
                    &quot;Type&quot;:&quot;Business&quot;,
                    &quot;updated_at&quot;:&quot;2018-09-14&quot;,
                    &quot;unused_field&quot;:&quot;en-US&quot;,
                    &quot;Description&quot;:&quot;Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.&quot;
                },
                {
                    &quot;Type&quot;:&quot;Financial&quot;,
                    &quot;updated_at&quot;:&quot;2022-08-26&quot;,
                    &quot;unused_field&quot;:&quot;en-US&quot;,
                    &quot;Description&quot;:&quot;Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.&quot;
                }
            ]
        },
        &quot;Advisors&quot;:{
            &quot;Auditor&quot;:{
                &quot;Code&quot;:&quot;AAA&quot;,
                &quot;Name&quot;:&quot;Aristotle&quot;
            }
        }
    }
}&quot;&quot;&quot;
df = spark.read.json(sc.parallelize([js_str]), multiLine=True)
df.select(col(&#39;GeneralInformation.ID&#39;),
          col(&#39;GeneralInformation.TextInfo.Text.Description&#39;)[0].alias(&quot;Description&quot;),
          col(&#39;GeneralInformation.TextInfo.Text.updated_at&#39;)[0].alias(&quot;updated_at&quot;)).\
            show(10,False)
#+-----+-----------------------------------------------------------------+----------+
#|ID   |Description                                                      |updated_at|
#+-----+-----------------------------------------------------------------+----------+
#|00001|Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.|2018-09-14|
#+-----+-----------------------------------------------------------------+----------+

英文:

Try by accessing array with the index.

Example:

from pyspark.sql.functions import *
js_str = &quot;&quot;&quot;{
    &quot;GeneralInformation&quot;:{
        &quot;ID&quot;:&quot;00001&quot;,
        &quot;WebLinksInfo&quot;:{
            &quot;LastUpdated&quot;:&quot;2019-10-27&quot;,
            &quot;WebSite&quot;:{
                &quot;Type&quot;:&quot;Home Page&quot;,
                &quot;text&quot;:&quot;https://www.aaaa.com/&quot;
            }
        },
        &quot;TextInfo&quot;:{
            &quot;Text&quot;:[
                {
                    &quot;Type&quot;:&quot;Business&quot;,
                    &quot;updated_at&quot;:&quot;2018-09-14&quot;,
                    &quot;unused_field&quot;:&quot;en-US&quot;,
                    &quot;Description&quot;:&quot;Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.&quot;
                },
                {
                    &quot;Type&quot;:&quot;Financial&quot;,
                    &quot;updated_at&quot;:&quot;2022-08-26&quot;,
                    &quot;unused_field&quot;:&quot;en-US&quot;,
                    &quot;Description&quot;:&quot;Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.&quot;
                }
            ]
        },
        &quot;Advisors&quot;:{
            &quot;Auditor&quot;:{
                &quot;Code&quot;:&quot;AAA&quot;,
                &quot;Name&quot;:&quot;Aristotle&quot;
            }
        }
    }
}&quot;&quot;&quot;
df = spark.read.json(sc.parallelize([js_str]), multiLine=True)
df.select(col(&#39;GeneralInformation.ID&#39;),
          col(&#39;GeneralInformation.TextInfo.Text.Description&#39;)[0].alias(&quot;Description&quot;),
          col(&#39;GeneralInformation.TextInfo.Text.updated_at&#39;)[0].alias(&quot;updated_at&quot;)).\
            show(10,False)
#+-----+-----------------------------------------------------------------+----------+
#|ID   |Description                                                      |updated_at|
#+-----+-----------------------------------------------------------------+----------+
#|00001|Lorem ipsum dolor sit amet, consectetur adipiscing elit, laborum.|2018-09-14|
#+-----+-----------------------------------------------------------------+----------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将复杂的爆炸数据帧中的选定列添加到另一个PySpark数据帧中。

问题

答案1

确定使用Python单独识别表格单元格。

如何使用Python重叠图表？

在Tkinter文本小部件中标记字符的方法

属性赋值预期。当与Jinja结合使用时，javascript

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。