英文:
Remove a column element inside an array in pyspark
问题
我有一个在PySpark中的模式(Schema),当其中的items
数组中包含item_platform_id
元素时,我需要移除它。我尝试使用drop
方法,但没有成功。
预期输出:
root
|-- MISSION_ID: string (nullable = true)
|-- COUNTRY: string (nullable = true)
|-- SPONSORED_MISSION: string (nullable = true)
|-- MISSION_TYPE: string (nullable = true)
|-- SPONSORED_SEGMENTATION: string (nullable = true)
|-- START_DATE: timestamp (nullable = true)
|-- END_DATE: timestamp (nullable = true)
|-- CREATE_DATE: timestamp (nullable = true)
|-- UPDATE_DATE: timestamp (nullable = true)
|-- SPONSOR_PARTNER_ID: string (nullable = true)
|-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
|-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
|-- DIGITALIZATION_LEVEL: string (nullable = true)
|-- ITEMS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- quantity: integer (nullable = true)
|-- COMBOS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- comboId: integer (nullable = true)
| | |-- quantity: integer (nullable = true)
|-- ENABLED: boolean (nullable = true)
英文:
I have a Schema in PySpark and I need to remove the item_platform_id
element from within the items
array when it comes.
I tested using drop but it didn't work.
root
|-- MISSION_ID: string (nullable = true)
|-- COUNTRY: string (nullable = true)
|-- SPONSORED_MISSION: string (nullable = true)
|-- MISSION_TYPE: string (nullable = true)
|-- SPONSORED_SEGMENTATION: string (nullable = true)
|-- START_DATE: timestamp (nullable = true)
|-- END_DATE: timestamp (nullable = true)
|-- CREATE_DATE: timestamp (nullable = true)
|-- UPDATE_DATE: timestamp (nullable = true)
|-- SPONSOR_PARTNER_ID: string (nullable = true)
|-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
|-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
|-- DIGITALIZATION_LEVEL: string (nullable = true)
|-- ITEMS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- quantity: integer (nullable = true)
| | |-- item_platform_id: string (nullable = true)
|-- COMBOS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- comboId: integer (nullable = true)
| | |-- quantity: integer (nullable = true)
|-- ENABLED: boolean (nullable = true)
Expected:
root
|-- MISSION_ID: string (nullable = true)
|-- COUNTRY: string (nullable = true)
|-- SPONSORED_MISSION: string (nullable = true)
|-- MISSION_TYPE: string (nullable = true)
|-- SPONSORED_SEGMENTATION: string (nullable = true)
|-- START_DATE: timestamp (nullable = true)
|-- END_DATE: timestamp (nullable = true)
|-- CREATE_DATE: timestamp (nullable = true)
|-- UPDATE_DATE: timestamp (nullable = true)
|-- SPONSOR_PARTNER_ID: string (nullable = true)
|-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
|-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
|-- DIGITALIZATION_LEVEL: string (nullable = true)
|-- ITEMS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- quantity: integer (nullable = true)
|-- COMBOS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- comboId: integer (nullable = true)
| | |-- quantity: integer (nullable = true)
|-- ENABLED: boolean (nullable = true)
答案1
得分: 1
你可以检查数组中结构体中字段的存在性,并可以使用 dropFields
来从结构体中移除该字段(自 Spark 3.1.2 起可用)。
示例:
# 样本数据
data_ls = [
(1, [(1,2,3), (4,5,6)])
]
data_sdf = spark.createDataFrame(data_ls, 'id int, items array<struct<_id: int, quantity: int, item_platform_id: int>')
# +---+----------------------+
# |id |items |
# +---+----------------------+
# |1 |[{1, 2, 3}, {4, 5, 6}]|
# +---+----------------------+
# 结构
# |-- id: integer (nullable = true)
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- _id: integer (nullable = true)
# | | |-- quantity: integer (nullable = true)
# | | |-- item_platform_id: integer (nullable = true)
# 检查存在性并在存在时移除
if 'item_platform_id' in data_sdf.withColumn('field', func.col('items')[0]).select('field.*').columns:
new_data_sdf = data_sdf. \
withColumn('items', func.transform('items', lambda x: x.dropFields('item_platform_id')))
new_data_sdf.show(truncate=False)
# +---+----------------+
# |id |items |
# +---+----------------+
# |1 |[{1, 2}, {4, 5}]|
# +---+----------------+
(注意:代码部分没有被翻译。)
英文:
you can check the existence of the field within the structs of the array, and you can use dropFields
to remove the field from the struct (available since spark 3.1.2).
example
# sample data
data_ls = [
(1, [(1,2,3), (4,5,6)])
]
data_sdf = spark.createDataFrame(data_ls, 'id int, items array<struct<_id: int, quantity: int, item_platform_id: int>>')
# +---+----------------------+
# |id |items |
# +---+----------------------+
# |1 |[{1, 2, 3}, {4, 5, 6}]|
# +---+----------------------+
# root
# |-- id: integer (nullable = true)
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- _id: integer (nullable = true)
# | | |-- quantity: integer (nullable = true)
# | | |-- item_platform_id: integer (nullable = true)
# check existence and remove if exists
if 'item_platform_id' in data_sdf.withColumn('field', func.col('items')[0]).select('field.*').columns:
new_data_sdf = data_sdf. \
withColumn('items', func.transform('items', lambda x: x.dropFields('item_platform_id')))
new_data_sdf.show(truncate=False)
# +---+----------------+
# |id |items |
# +---+----------------+
# |1 |[{1, 2}, {4, 5}]|
# +---+----------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论