2023年2月24日 02:08:03go评论96阅读模式

英文:

Remove a column element inside an array in pyspark

问题

我有一个在PySpark中的模式（Schema），当其中的items数组中包含item_platform_id元素时，我需要移除它。我尝试使用drop方法，但没有成功。

预期输出：

root
 |-- MISSION_ID: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- SPONSORED_MISSION: string (nullable = true)
 |-- MISSION_TYPE: string (nullable = true)
 |-- SPONSORED_SEGMENTATION: string (nullable = true)
 |-- START_DATE: timestamp (nullable = true)
 |-- END_DATE: timestamp (nullable = true)
 |-- CREATE_DATE: timestamp (nullable = true)
 |-- UPDATE_DATE: timestamp (nullable = true)
 |-- SPONSOR_PARTNER_ID: string (nullable = true)
 |-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
 |-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
 |-- DIGITALIZATION_LEVEL: string (nullable = true)
 |-- ITEMS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- COMBOS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- comboId: integer (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- ENABLED: boolean (nullable = true)

英文:

I have a Schema in PySpark and I need to remove the item_platform_id element from within the items array when it comes.
I tested using drop but it didn't work.

root
 |-- MISSION_ID: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- SPONSORED_MISSION: string (nullable = true)
 |-- MISSION_TYPE: string (nullable = true)
 |-- SPONSORED_SEGMENTATION: string (nullable = true)
 |-- START_DATE: timestamp (nullable = true)
 |-- END_DATE: timestamp (nullable = true)
 |-- CREATE_DATE: timestamp (nullable = true)
 |-- UPDATE_DATE: timestamp (nullable = true)
 |-- SPONSOR_PARTNER_ID: string (nullable = true)
 |-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
 |-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
 |-- DIGITALIZATION_LEVEL: string (nullable = true)
 |-- ITEMS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |    |    |-- item_platform_id: string (nullable = true)
 |-- COMBOS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- comboId: integer (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- ENABLED: boolean (nullable = true)

Expected:

root
 |-- MISSION_ID: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- SPONSORED_MISSION: string (nullable = true)
 |-- MISSION_TYPE: string (nullable = true)
 |-- SPONSORED_SEGMENTATION: string (nullable = true)
 |-- START_DATE: timestamp (nullable = true)
 |-- END_DATE: timestamp (nullable = true)
 |-- CREATE_DATE: timestamp (nullable = true)
 |-- UPDATE_DATE: timestamp (nullable = true)
 |-- SPONSOR_PARTNER_ID: string (nullable = true)
 |-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
 |-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
 |-- DIGITALIZATION_LEVEL: string (nullable = true)
 |-- ITEMS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- COMBOS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- comboId: integer (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- ENABLED: boolean (nullable = true)

答案1

得分: 1

你可以检查数组中结构体中字段的存在性，并可以使用 dropFields 来从结构体中移除该字段（自 Spark 3.1.2 起可用）。

示例：

# 样本数据
data_ls = [
    (1, [(1,2,3), (4,5,6)])
]
data_sdf = spark.createDataFrame(data_ls, 'id int, items array<struct<_id: int, quantity: int, item_platform_id: int>')
# +---+----------------------+
# |id |items                 |
# +---+----------------------+
# |1  |[{1, 2, 3}, {4, 5, 6}]|
# +---+----------------------+
# 结构
#  |-- id: integer (nullable = true)
#  |-- items: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- _id: integer (nullable = true)
#  |    |    |-- quantity: integer (nullable = true)
#  |    |    |-- item_platform_id: integer (nullable = true)
# 检查存在性并在存在时移除
if 'item_platform_id' in data_sdf.withColumn('field', func.col('items')[0]).select('field.*').columns:
    new_data_sdf = data_sdf. \
        withColumn('items', func.transform('items', lambda x: x.dropFields('item_platform_id')))
new_data_sdf.show(truncate=False)
# +---+----------------+
# |id |items           |
# +---+----------------+
# |1  |[{1, 2}, {4, 5}]|
# +---+----------------+

（注意：代码部分没有被翻译。）

英文:

you can check the existence of the field within the structs of the array, and you can use dropFields to remove the field from the struct (available since spark 3.1.2).

example

# sample data
data_ls = [
    (1, [(1,2,3), (4,5,6)])
]
data_sdf = spark.createDataFrame(data_ls, &#39;id int, items array&lt;struct&lt;_id: int, quantity: int, item_platform_id: int&gt;&gt;&#39;)
# +---+----------------------+
# |id |items                 |
# +---+----------------------+
# |1  |[{1, 2, 3}, {4, 5, 6}]|
# +---+----------------------+
# root
#  |-- id: integer (nullable = true)
#  |-- items: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- _id: integer (nullable = true)
#  |    |    |-- quantity: integer (nullable = true)
#  |    |    |-- item_platform_id: integer (nullable = true)
# check existence and remove if exists
if &#39;item_platform_id&#39; in data_sdf.withColumn(&#39;field&#39;, func.col(&#39;items&#39;)[0]).select(&#39;field.*&#39;).columns:
    new_data_sdf = data_sdf. \
        withColumn(&#39;items&#39;, func.transform(&#39;items&#39;, lambda x: x.dropFields(&#39;item_platform_id&#39;)))
new_data_sdf.show(truncate=False)
# +---+----------------+
# |id |items           |
# +---+----------------+
# |1  |[{1, 2}, {4, 5}]|
# +---+----------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从一个数组中移除一个列元素在pyspark中

问题

答案1

Python XlsxWriter ChartSheet水印不起作用。

根据这些数量添加列。

Type “vector” 在postgresql – langchain 上不存在

CancelledError异常处理程序未触发。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。