从一个数组中移除一个列元素在pyspark中

huangapple go评论59阅读模式
英文:

Remove a column element inside an array in pyspark

问题

我有一个在PySpark中的模式(Schema),当其中的items数组中包含item_platform_id元素时,我需要移除它。我尝试使用drop方法,但没有成功。

预期输出:

root
 |-- MISSION_ID: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- SPONSORED_MISSION: string (nullable = true)
 |-- MISSION_TYPE: string (nullable = true)
 |-- SPONSORED_SEGMENTATION: string (nullable = true)
 |-- START_DATE: timestamp (nullable = true)
 |-- END_DATE: timestamp (nullable = true)
 |-- CREATE_DATE: timestamp (nullable = true)
 |-- UPDATE_DATE: timestamp (nullable = true)
 |-- SPONSOR_PARTNER_ID: string (nullable = true)
 |-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
 |-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
 |-- DIGITALIZATION_LEVEL: string (nullable = true)
 |-- ITEMS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- COMBOS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- comboId: integer (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- ENABLED: boolean (nullable = true)
英文:

I have a Schema in PySpark and I need to remove the item_platform_id element from within the items array when it comes.
I tested using drop but it didn't work.

root
 |-- MISSION_ID: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- SPONSORED_MISSION: string (nullable = true)
 |-- MISSION_TYPE: string (nullable = true)
 |-- SPONSORED_SEGMENTATION: string (nullable = true)
 |-- START_DATE: timestamp (nullable = true)
 |-- END_DATE: timestamp (nullable = true)
 |-- CREATE_DATE: timestamp (nullable = true)
 |-- UPDATE_DATE: timestamp (nullable = true)
 |-- SPONSOR_PARTNER_ID: string (nullable = true)
 |-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
 |-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
 |-- DIGITALIZATION_LEVEL: string (nullable = true)
 |-- ITEMS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |    |    |-- item_platform_id: string (nullable = true)
 |-- COMBOS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- comboId: integer (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- ENABLED: boolean (nullable = true)

Expected:

root
 |-- MISSION_ID: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- SPONSORED_MISSION: string (nullable = true)
 |-- MISSION_TYPE: string (nullable = true)
 |-- SPONSORED_SEGMENTATION: string (nullable = true)
 |-- START_DATE: timestamp (nullable = true)
 |-- END_DATE: timestamp (nullable = true)
 |-- CREATE_DATE: timestamp (nullable = true)
 |-- UPDATE_DATE: timestamp (nullable = true)
 |-- SPONSOR_PARTNER_ID: string (nullable = true)
 |-- CONSIDER_DELIVERY_WINDOW: boolean (nullable = true)
 |-- CONSIDER_BLOCK_LIST: boolean (nullable = true)
 |-- DIGITALIZATION_LEVEL: string (nullable = true)
 |-- ITEMS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- COMBOS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- comboId: integer (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |-- ENABLED: boolean (nullable = true)

答案1

得分: 1

你可以检查数组中结构体中字段的存在性,并可以使用 dropFields 来从结构体中移除该字段(自 Spark 3.1.2 起可用)。

示例:

# 样本数据
data_ls = [
    (1, [(1,2,3), (4,5,6)])
]

data_sdf = spark.createDataFrame(data_ls, 'id int, items array<struct<_id: int, quantity: int, item_platform_id: int>')

# +---+----------------------+
# |id |items                 |
# +---+----------------------+
# |1  |[{1, 2, 3}, {4, 5, 6}]|
# +---+----------------------+

# 结构
#  |-- id: integer (nullable = true)
#  |-- items: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- _id: integer (nullable = true)
#  |    |    |-- quantity: integer (nullable = true)
#  |    |    |-- item_platform_id: integer (nullable = true)

# 检查存在性并在存在时移除
if 'item_platform_id' in data_sdf.withColumn('field', func.col('items')[0]).select('field.*').columns:
    new_data_sdf = data_sdf. \
        withColumn('items', func.transform('items', lambda x: x.dropFields('item_platform_id')))

new_data_sdf.show(truncate=False)

# +---+----------------+
# |id |items           |
# +---+----------------+
# |1  |[{1, 2}, {4, 5}]|
# +---+----------------+

(注意:代码部分没有被翻译。)

英文:

you can check the existence of the field within the structs of the array, and you can use dropFields to remove the field from the struct (available since spark 3.1.2).

example

# sample data
data_ls = [
    (1, [(1,2,3), (4,5,6)])
]

data_sdf = spark.createDataFrame(data_ls, &#39;id int, items array&lt;struct&lt;_id: int, quantity: int, item_platform_id: int&gt;&gt;&#39;)

# +---+----------------------+
# |id |items                 |
# +---+----------------------+
# |1  |[{1, 2, 3}, {4, 5, 6}]|
# +---+----------------------+

# root
#  |-- id: integer (nullable = true)
#  |-- items: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- _id: integer (nullable = true)
#  |    |    |-- quantity: integer (nullable = true)
#  |    |    |-- item_platform_id: integer (nullable = true)

# check existence and remove if exists
if &#39;item_platform_id&#39; in data_sdf.withColumn(&#39;field&#39;, func.col(&#39;items&#39;)[0]).select(&#39;field.*&#39;).columns:
    new_data_sdf = data_sdf. \
        withColumn(&#39;items&#39;, func.transform(&#39;items&#39;, lambda x: x.dropFields(&#39;item_platform_id&#39;)))

new_data_sdf.show(truncate=False)

# +---+----------------+
# |id |items           |
# +---+----------------+
# |1  |[{1, 2}, {4, 5}]|
# +---+----------------+

huangapple
  • 本文由 发表于 2023年2月24日 02:08:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75548706.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定