英文:
How can I unpivot two sets of columns in Spark?
问题
我正在寻找将两组列解压缩以使其变为:
我尝试使用stack函数解压两组列,但我只能解压一组。每当我使用两个stack函数时,都会引发异常,表示我只能使用一个生成器函数:
unpivotedDf = df.selectExpr(
"date",
"stack(2, 'day', sales_day, 'night', sales_night) as (timeOfDay, sales)",
"stack(2, 'day', customers_day, 'night', customers_night) as (timeOfDay, customers)"
)
我还可以分别解压缩两组列,然后在日期和时间上连接单独的表,但对于大型数据集来说效率非常低。有没有其他的选择?
英文:
I'm looking to unpivot two sets of columns so that this:
Is transformed into this:
I have tried using the stack function to unpivot the two sets of columns, but I can only unpivot one set. Whenever I use two stack functions, an exception is raised saying I can only use one generator function:
unpivotedDf = df.selectExpr(
"date",
"stack(2, 'day', sales_day, 'night', sales_night) as (timeOfDay, sales)",
"stack(2, 'day', customers_day, 'night', customers_night) as (timeOfDay, customers)"
)
I can also unpivot both sets of columns separately and then join the separate tables on date and on timeOfDay, but this is really inefficient for large datasets. Is there an alternative?
答案1
得分: 1
一种方法是将整个数据进行反规范化,然后再逆转所需的列。
这里有一个示例:
data_sdf. \
selectExpr('date',
'stack(4, \
"sales_day", sales_day, \
"sales_night", sales_night, \
"cust_day", cust_day, \
"cust_night", cust_night\
) as (key, val)'
). \
withColumn('key_split', func.split('key', '_')). \
selectExpr('date', 'key_split[0] as attr', 'key_split[1] as time_of_day', 'val'). \
groupBy('date', 'time_of_day'). \
pivot('attr', values=['cust', 'sales']). \
agg(func.first('val')). \
orderBy('date', 'time_of_day'). \
show()
# +-------+-----------+----+-----+
# | date|time_of_day|cust|sales|
# +-------+-----------+----+-----+
# |june 15| day| 11| 1|
# |june 15| night| 12| 2|
# |june 16| day| 13| 3|
# |june 16| night| 14| 4|
# |june 17| day| 15| 5|
# |june 17| night| 15| 5|
# +-------+-----------+----+-----+
请注意,这只是翻译了您提供的代码部分,不包括问题或其他内容。
英文:
one way is to unpivot the whole data, and then pivot the columns required.
here's an example
data_sdf. \
selectExpr('date',
'stack(4, \
"sales_day", sales_day, \
"sales_night", sales_night, \
"cust_day", cust_day, \
"cust_night", cust_night\
) as (key, val)'
). \
withColumn('key_split', func.split('key', '_')). \
selectExpr('date', 'key_split[0] as attr', 'key_split[1] as time_of_day', 'val'). \
groupBy('date', 'time_of_day'). \
pivot('attr', values=['cust', 'sales']). \
agg(func.first('val')). \
orderBy('date', 'time_of_day'). \
show()
# +-------+-----------+----+-----+
# | date|time_of_day|cust|sales|
# +-------+-----------+----+-----+
# |june 15| day| 11| 1|
# |june 15| night| 12| 2|
# |june 16| day| 13| 3|
# |june 16| night| 14| 4|
# |june 17| day| 15| 5|
# |june 17| night| 15| 5|
# +-------+-----------+----+-----+
答案2
得分: 1
让我们创建一个将时间与相应列对应的映射。然后,我们可以“展开”这个映射并选择所需的列。
from itertools import chain
c = chain(*[(F.lit(c), F.array(f'sales_{c}', f'customers_{c}')) for c in ('day', 'night')])
result = (
df
.select('date', F.explode(F.create_map(*c)))
.selectExpr('date', 'key as time_of_day', 'value[0] as sales', 'value[1] as customers')
)
结果
+-------+-----------+-----+---------+
| date|time_of_day|sales|customers|
+-------+-----------+-----+---------+
|June 15| day| 1| 11|
|June 15| night| 2| 12|
|June 16| day| 3| 13|
|June 16| night| 4| 14|
|June 17| day| 5| 15|
|June 17| night| 6| 16|
+-------+-----------+-----+---------+
英文:
Lets create a map of time of day to the corresponding columns. Then we can explode
the map and select the required columns.
from itertools import chain
c = chain(*[(F.lit(c), F.array(f'sales_{c}', f'customers_{c}')) for c in ('day', 'night')])
result = (
df
.select('date', F.explode(F.create_map(*c)))
.selectExpr('date', 'key as time_of_day', 'value[0] as sales', 'value[1] as customers')
)
Result
+-------+-----------+-----+---------+
| date|time_of_day|sales|customers|
+-------+-----------+-----+---------+
|June 15| day| 1| 11|
|June 15| night| 2| 12|
|June 16| day| 3| 13|
|June 16| night| 4| 14|
|June 17| day| 5| 15|
|June 17| night| 6| 16|
+-------+-----------+-----+---------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论