How can I unpivot two sets of columns in Spark?

huangapple go评论82阅读模式
英文:

How can I unpivot two sets of columns in Spark?

问题

我正在寻找将两组列解压缩以使其变为:

How can I unpivot two sets of columns in Spark?

我尝试使用stack函数解压两组列,但我只能解压一组。每当我使用两个stack函数时,都会引发异常,表示我只能使用一个生成器函数:

unpivotedDf = df.selectExpr(
    "date",
    "stack(2, 'day', sales_day, 'night', sales_night) as (timeOfDay, sales)",
    "stack(2, 'day', customers_day, 'night', customers_night) as (timeOfDay, customers)"
)

我还可以分别解压缩两组列,然后在日期和时间上连接单独的表,但对于大型数据集来说效率非常低。有没有其他的选择?

英文:

I'm looking to unpivot two sets of columns so that this:
How can I unpivot two sets of columns in Spark?

Is transformed into this:

How can I unpivot two sets of columns in Spark?

I have tried using the stack function to unpivot the two sets of columns, but I can only unpivot one set. Whenever I use two stack functions, an exception is raised saying I can only use one generator function:

unpivotedDf = df.selectExpr(
    "date",
    "stack(2, 'day', sales_day, 'night', sales_night) as (timeOfDay, sales)",
    "stack(2, 'day', customers_day, 'night', customers_night) as (timeOfDay, customers)"
)

I can also unpivot both sets of columns separately and then join the separate tables on date and on timeOfDay, but this is really inefficient for large datasets. Is there an alternative?

答案1

得分: 1

一种方法是将整个数据进行反规范化,然后再逆转所需的列。

这里有一个示例:

data_sdf. \
    selectExpr('date', 
               'stack(4, \
                      "sales_day", sales_day, \
                      "sales_night", sales_night, \
                      "cust_day", cust_day, \
                      "cust_night", cust_night\
                      ) as (key, val)'
               ). \
    withColumn('key_split', func.split('key', '_')). \
    selectExpr('date', 'key_split[0] as attr', 'key_split[1] as time_of_day', 'val'). \
    groupBy('date', 'time_of_day'). \
    pivot('attr', values=['cust', 'sales']). \
    agg(func.first('val')). \
    orderBy('date', 'time_of_day'). \
    show()

# +-------+-----------+----+-----+
# |   date|time_of_day|cust|sales|
# +-------+-----------+----+-----+
# |june 15|        day|  11|    1|
# |june 15|      night|  12|    2|
# |june 16|        day|  13|    3|
# |june 16|      night|  14|    4|
# |june 17|        day|  15|    5|
# |june 17|      night|  15|    5|
# +-------+-----------+----+-----+

请注意,这只是翻译了您提供的代码部分,不包括问题或其他内容。

英文:

one way is to unpivot the whole data, and then pivot the columns required.

here's an example

data_sdf. \
    selectExpr('date', 
               'stack(4, \
                      "sales_day", sales_day, \
                      "sales_night", sales_night, \
                      "cust_day", cust_day, \
                      "cust_night", cust_night\
                      ) as (key, val)'
               ). \
    withColumn('key_split', func.split('key', '_')). \
    selectExpr('date', 'key_split[0] as attr', 'key_split[1] as time_of_day', 'val'). \
    groupBy('date', 'time_of_day'). \
    pivot('attr', values=['cust', 'sales']). \
    agg(func.first('val')). \
    orderBy('date', 'time_of_day'). \
    show()

# +-------+-----------+----+-----+
# |   date|time_of_day|cust|sales|
# +-------+-----------+----+-----+
# |june 15|        day|  11|    1|
# |june 15|      night|  12|    2|
# |june 16|        day|  13|    3|
# |june 16|      night|  14|    4|
# |june 17|        day|  15|    5|
# |june 17|      night|  15|    5|
# +-------+-----------+----+-----+

答案2

得分: 1

让我们创建一个将时间与相应列对应的映射。然后,我们可以“展开”这个映射并选择所需的列。

from itertools import chain

c = chain(*[(F.lit(c), F.array(f'sales_{c}', f'customers_{c}')) for c in ('day', 'night')])

result = (
    df
    .select('date', F.explode(F.create_map(*c)))
    .selectExpr('date', 'key as time_of_day', 'value[0] as sales', 'value[1] as customers')
)

结果

+-------+-----------+-----+---------+
|   date|time_of_day|sales|customers|
+-------+-----------+-----+---------+
|June 15|        day|    1|       11|
|June 15|      night|    2|       12|
|June 16|        day|    3|       13|
|June 16|      night|    4|       14|
|June 17|        day|    5|       15|
|June 17|      night|    6|       16|
+-------+-----------+-----+---------+
英文:

Lets create a map of time of day to the corresponding columns. Then we can explode the map and select the required columns.

from itertools import chain

c = chain(*[(F.lit(c), F.array(f'sales_{c}', f'customers_{c}')) for c in ('day', 'night')])

result = (
    df
    .select('date', F.explode(F.create_map(*c)))
    .selectExpr('date', 'key as time_of_day', 'value[0] as sales', 'value[1] as customers')
)

Result

+-------+-----------+-----+---------+
|   date|time_of_day|sales|customers|
+-------+-----------+-----+---------+
|June 15|        day|    1|       11|
|June 15|      night|    2|       12|
|June 16|        day|    3|       13|
|June 16|      night|    4|       14|
|June 17|        day|    5|       15|
|June 17|      night|    6|       16|
+-------+-----------+-----+---------+

huangapple
  • 本文由 发表于 2023年6月22日 13:20:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76528790.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定